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cn 1 Introduction 

<N 

The notion of regularization has been widely used as a tool to address a 
number of problems that are usually encountered in Machine Learning. Im- 
proving the performance of an estimator by shrinking the norm of the MVU 
i— l estimator, guarding against overfitting, coping with ill-conditioning, provid- 

ing a solution to an underdetermined set of equations, are some notable 
examples where regularization has provided successful answers. A notable 
example is the ridge regression concept, where the LS loss function is com- 
bined, in a tradeoff rationale, with the Euclidean norm of the desired solu- 
tion. 

In this paper, our interest will be on alternatives to the Euclidean norms 
and in particular the focus will revolve around the t\ norm; this is the sum of 
the absolute values of the components comprising a vector. Although seeking 
a solution to a problem via the t\ norm regularization of a loss function has 
been known and used since the 1970s, it is only recently that has become 
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the focus of attention of a massive volume of research in the context of 
compressed sensing. At the heart of this problem lies an underdetermined 
set of linear equations, which, in general, accepts an infinite number of 
solutions. However, in a number of cases, an extra piece of information is 
available: the true model, whose estimate we want to obtain, is sparse; that 
is, only a few of its coordinates are nonzero. It turns out that a large number 
of commonly used applications can be cast under such a scenario and can 
be benefited by a so-called sparse modeling. 

Besides its practical significance, sparsity-aware processing has offered to 
the scientific community novel theoretical tools and solutions to problems 
that only a few years ago seemed to be intractable. This is also a reason that 
this is an interdisciplinary field of research encompassing scientists from, e.g., 
mathematics, statistics, machine learning, signal processing. Moreover, it 
has already been applied in many areas ranging from biomedicine, to com- 
munications and astronomy. At the time this paper is compiled, there is a 
"research happening" in this field, which poses some difficulties in assem- 
bling related material together. We have made an effort to put together, in 
a unifying way, the basic notions and ideas that run across this new field. 
Our goal is to provide the reader with an overview of the major contribu- 
tions which took place in the theoretical and algorithmic fronts and have 
been consolidated over the last decade or so. Besides the methods and algo- 
rithms which are reviewed in this article, there is another path of methods 
based on the Bayesian learning rationale. Such techniques will be reviewed 
elsewhere. 

2 Parameter Estimation 

Parameter estimation is at the heart of what is known as Machine Learning; 
a term that is used more and more as an umbrella for a number of scientific 
topics that have evolved over the years within different communities, such 
as Signal Processing, Statistical Learning, Estimation/Detection, Control, 
Neurosciences, Statistical Physics, to name but a few. 

In its more general and formal setting, the parameter estimation task 
is defined as follows. Given a set of data points (y n ,x n ), y n G 1Z, x n G 
1Z l , , n = 1, 2, . . . , N, known as the training data, and a parametric set of 
functions 

F := {fo, 0GiC K k }, 

find a function in J 7 , which will be denoted as /(•) := feS')-> such that given 
the value of x G TZ l , f(x) best approximates the corresponding value of 
y G 1Z. After all, the main goal of Machine Learning is prediction. In a more 
general setting, y can also be a vector y G 7Z m . Most of our discussion here 
will be limited to real valued variables. Obviously, extensions to complex 
valued data are readily available. 
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Figure 1: Block diagram showing the input-output relation in a regression 
model. 

Having adopted the parametric set of functions and given the the training 
data set, the goal becomes that of estimating the values of the parameters 
6 so that to "fit" the data in some (optimal) way. There are various paths 
to achieve this goal. In this paper, our approach comprises the adoption of 
a loss function 

£(•,•) : K x TZ\ — ► [0,oo), 

and obtain such that 

0* := argmin J (6), 

where 

N 

J(e):=^2c(y n ,f e (x)). (1) 

n=l 

In this review article, the focus will be on the Least Squares loss function, 
i.e., 

C(y,fg(x)) := (y-f e ( x )) 2 . 

Among the many parametric models, regression covers a large class of 
Machine Learning tasks. In linear regression, one models the relationship of 
a dependent variable y, which is considered as the output of a system, with a 
set independent variables, xi, x 2 , ■ ■ ■ , xi, which are thought as the respective 
inputs that activate the system in the presence of a noise (unobserved) 
disturbance, 77, i.e., 

y = Qixi + . . . + Q\xi + # + f), 

where 9q is known as the bias or intercept, see Figure [T] Very often the 
previous input-output relationship is written as 

y = X T 6 + 7] (2) 

where 

6 := [e 1 ,...,0 o ] T , and x := [x 1: . . . ,x h lf . (3) 

Often, x is called the regressor. Given the set of training data points, 
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(y n ,x n ), n = 1,2, 



, N, ([2]) can compactly written as 
y = X6 + rj 



where 



X := 



r t 1 




yi 




m 


T 

_ x N _ 


, y = 




, v = 













(4) 



(5) 



For such a model, the Least Squares cost function becomes 



N 



J{6) = 



G T x n f 



y 



xe\ 



(6) 



n=l 



where || • || denotes the Euclidean norm. Minimizing ^ with respect to 6 
results to the celebrated LS estimate 



e LS = {x T xy x x T y , 



(7) 



assuming the the matrix inversion is possible. However, for many practical 
cases, the cost function in ^ is augmented with a so called regularization 
term. There are a number of reasons that justify the use of a regulariza- 
tion term. Guarding against overfitting, purposely introducing bias in the 
estimator in order to improve the overall performance, dealing with the ill 
conditioning of the task are examples in which the use of regularization ad- 
dresses successfully. Ridge regression is a celebrated example, where the cost 
function is augmented as 



J{6) 

leading to the estimate 



|y-X0|| 2 + A||0|| 2 , A>0 



Or = (X T X + XI)- 1 X T y, 

where I is the identity matrix. 

The major goal of this review article is to focus at alternative norms in 
place of the Euclidean norm, which was employed in ridge regression. As 
we will see, there are many good reasons in doing that. 



3 Searching for a Norm 

Mathematicians have been very imaginative in proposing various norms in 
order to equip linear spaces. Among the most popular norms used in func- 
tional analysis are the so-called £ p norms. To tailor things to our needs, 
given a vector £ its £ p norm is defined as 
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For p = 2, the Euclidean or £2 norm is obtained, and for p = 1, Q results 
in the £\ norm, i.e., 

I 

l|0|li = £|0i|- (9) 

i=l 

If we let p — > 00, then we get the £oo norm; let |#i max | := max {|#i|, |6>2|, • • • , 
and notice that 



(10) 



that is, || ^ II 00 * s ec l ua l to the maximum of the absolute values of the coor- 
dinates of 9. One can show that all the £ p norms are true norms for p > 1; 
they satisfy all four requirements that a function Tc — > [0, 00) must respect 
in order to be called a norm, i.e., 

1. l|0|l P >o. 

2. ||0|| p = 0^0 = 0. 

3. ||a0|| p = |q| ||0|| p , Ma e K. 

4. ||6>i + 6> 2 || p < ||6»i|| p + ||0 2 || p . 

The third condition enforces the norm function to be {positively) homoge- 
neous and the fourth one is the triangle inequality. These properties also 
guarantee that any function that is a norm is also a convex one. Although 
strictly speaking, if we allow p > to take values less than one in Q, the 
resulting function is easily shown not to be a true norm, we can still call 
them norms, albeit knowing that this is an abuse of the definition of a norm. 
An interesting case, which will be used extensively in this paper, is the £q 
norm, which can be obtained as the limit, for p — > 0, of 

l l 



iieiio : = = % E i^i p = E ^(N), (ii) 

i=l i=l 

where xa{') is the characteristic function with respect to a set A, defined as 

xa(t) ■= 



1, if re A 
0, if r $ A. 



That is, the £q norm is equal to the number of nonzero components of the 
respective vector. It is very easy to check that this function is not a true 
norm. Indeed, this function is not homogeneous, i.e., ||a0|| o 7^ |a| ||0|| o , 
Va 7^ 1. Fig. [2] shows the isovalue curves, in the two-dimensional space, that 
correspond to ||0|L = p = 1, for p = 0, 0.5, 1, 2, and 00. Observe that for the 
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Figure 2: The isovalue curves for \\6\\ p = 1 and for various values of p, in 
the two dimensional space. Observe that for the £q norm, the respective 
values cover the two axes with the exception of the point (0, 0). For the £\ 
norm the isovalue curve is a rhombus and for the £2 (Euclidean) norm, it is 
a circle. 

4 

3 

\e\ p 2 
1 
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6 

Figure 3: Observe that the epigraph, that is, the region above the graph, 
is nonconvex for values p < 1, indicating the nonconvexity of the respective 
I • \ p function. The value p = 1 is the smallest one for which convexity is 
retained. Also note that, for large values of p > 1, the contribution of small 
values of 9 to the respective norm becomes insignificant. 

Euclidean norm the isovalue curve has the shape of a "ball" and for the £\ 
norm the shape of a rhombus. We refer to them as the £2 and the l\ balls, 
respectively, by slightly "abusing" the meaning of a balQ Observe that in 
the case of the £q norm, the isovalue curve comprises both the horizontal and 
the vertical axes, excluding the (0, 0) element. If we restrict the size of the 
£q norm to be less than one, then the corresponding set of points becomes 
a singleton, i.e., (0,0). Also, the set of all the points that have £0 norm less 
than or equal to two, is the 1Z 2 space. This, slightly "strange" behavior, is 
a consequence of the discrete nature of this "norm" . 




\p = 4 






= 2 / / ■ 




-P= l / / 




/ p = 0.5 I / ^ 









1 Strictly speaking, a ball must also contain all the points in the interior. 
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Fig. [3] shows the graph of | • | p , which is the individual contribution of 
each component of a vector to the £ p norm, for different values of p. Observe 
that a) for p < 1, the region which is formed above the graph (known as 
epigraph) is not a convex one, which verifies what we have already said; i.e, 
the respective function is not a true norm, b) for values of the argument 
\8\ > 1, the larger the value of p > 1 and the larger the value of \9\ the 
higher its respective contribution to the norm. Hence, if £ p norms, p > 1, 
are used to regularize a loss function, such large values become the dominant 
ones and the optimization algorithm will concentrate on these by penalizing 
them to get smaller, so that the overall cost to be reduced. On the other 
hand, for values of the argument \9\ < 1 and closer to zero, the l\ norm is 
the only one (among p > 1) that retains relatively large values and, hence, 
the respective components can still have a say in the optimization process 
and can be penalized by being pushed to smaller values. Hence, if the l\ 
norm is used to replace the £2 one in the regularization equation, only those 
components of the vector, that are really significant in reducing the model 
misfit measuring term in the regularized cost function, will be kept and the 
rest will be forced to zero. The same tendency, yet more aggressive, is true 
for < p < 1. The extreme case is when one considers the £q norm. Even a 
small increase of a component from zero, makes its contribution to the norm 
large, so the optimizing algorithm has to be very "cautious" in making an 
element nonzero. 

From all the true norms (p > 1), the l\ is the only one that shows respect 
to small values. The rest of the t v norms, p > 1, just squeeze them, to make 
their values even smaller and care, mainly, for the large values. We will 
return to this point very soon. 



4 The Least Absolute Shrinkage and Selection Op- 
erator (LASSO) 

We have already discussed some of the benefits in adopting the regularization 
method for enhancing the performance of an estimator. However, in this 
paper, we are going to see and study more reasons that justify the use of 
regularization. The first one refers to what is known as the interpretation 
power of an estimator. For example, in the regression task, we want to 
select those components, 9i, of 6 that have the most important say in the 
formation of the output variable. This is very important if the number of 
parameters, I, is large and we want to concentrate on the most important 



of them. In a classification task |Theo 09 , not all features are informative, 



hence one would like to keep the most informative of them and make the less 
informative ones equal to zero. Another related problem refers to those cases 
where we know, a-priori, that a number of the components of a parameter 
vector are zero but we do not know which ones. The discussion we had 
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at the end of the previous section starts now to become more meaningful. 
Can we use, while regularizing, an appropriate norm that can assist the 
optimization process a) in unveiling such zeros or b) to put more emphasis 
on the most significant of its components, those that play a decisive role in 
reducing the misfit measuring term in the regularized cost function, and set 
the rest of them equal to zero? Although the £ p norms, with p < 1, seem 
to be the natural choice for such a regularization, the fact that they are not 
convex makes the optimization process hard. The t\ norm is the one that is 
"closest" to them yet it retains the computationally attractive property of 
convexity. 

The t\ norm has been used for such problems for a long time. In the 



seventies, it was used in seismology Tayl 79 , Clae 73 , where the reflected 



signal, that indicates changes in the various earth substrates, is a sparse 
one, i.e., very few values are relatively large and the rest are small and 
insignificant. Since then, it has been used to tackle similar problems in 



different applications, e.g., [Sant 86 , Dono 92 . However, one can trace two 



papers that were really catalytic in providing the spark for the current strong 



Sant86|) , 



Chen 98], 



interest around the i\ norm. One came from statistics, Tibs 96 , which 
addressed the LASSO task (first formulated, to our knowledge, in 
to be discussed next, and one from the signal analysis community, 
which formulated the Basis Pursuit, to be discussed in a later section. 
We first address our familiar regression task 

y = Xe + rj, y, V €K N ,0£K l , 

and obtain the estimate of the unknown parameter 6 via the LS loss, regu- 
larized by the i\ norm, i.e., for A > 0, 

6 := argmin 0g7 ^i L(6, A) (12) 
:= argmin 0e7 ^ (^(yn - x T n ef + A \\9\\j\ 

= a rgmm 0En i((y-Xe) T (y-Xe) + X\\e\\ 1 ). (13) 

In order to simplify the analysis, we will assume hereafter, without harming 
generality, that the data are centered. If this is not the case, the data can 
be centered by subtracting the sample mean y from each one of the output 
values. The estimate of the bias term will be equal to the sample mean y. 



The task in ( 13 ) can be equivalently written in the following two formulations 



: min (y - X0) T (y - XO), 
oen 1 

s.t. ||0||i</£>, (14) 
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or 



min 

oen 1 



11 ' 



s.t. (y-XOf(y-XO) < e, 



(15) 



given the user-defined parameters p, e > 0. The formulation in (14) is known 



as the LASSO and the one in (15) as the Basis Pursuit Denoising (BPDN), 
e.g., [Bruc 09] . All three formulations can be shown to be equivalent for 
specific choices of A,e, and p. The minimized cost function in (13) corre- 



sponds to the Lagrangian of the formulations in (14) and (15). However 



this functional dependence is hard to compute, unless the columns of X are 
mutually orthogonal. Moreover, this equivalence does not necessarily imply 
that all three formulations are equally easy or difficult to solve. As we will 
see later on, algorithms have been developed along each one of the previous 
formulations. From now on, we will refer to all three formulations as the 
LASSO task, in a slight abuse of the standard terminology, and the specific 
formulation will be apparent from the context, if not stated explicitly. 

As it was discussed before, the Ridge regression admits a closed form 
solution, i.e, 



6 R = (X T X + XI) X 1 



y- 



In contrast, this is not the case for LASSO and its solution requires iterative 
techniques. It is straightforward to see that LASSO can be formulated as a 
standard convex quadratic problem with linear inequalities. Indeed, we can 



rewrite (13) as 



min {y - X0) T {y - XO) + X\^ 

{0i,ui} l i=1 " ' ^ 



8=1 



S.t. 



Ui < 6; < Ui 



Ui > 0, 



i — 1,2, ... ,1, 



which can be solved by any standard convex optimization method, e.g. 



[Ye 97l|Bqyd~04 



The reason that developing algorithms for the LASSO 
has been a hot research topic is due to the emphasis in obtaining efficient 
algorithms by exploiting the specific nature of this task, especially for cases 
where I is very large, as it is often the case in practice. 

In order to get a better insight of the nature of the solution that is ob- 
tained by LASSO, let us assume that the regressors are mutually orthogonal 
and of unit norm, hence X T X = I. Orthogonality of the input matrix helps 
to decouple the coordinates and results to I one-dimensional problems, that 
can be solved analytically. For this case, the LS estimator becomes 



e 



LS 



(X 1 X)- L X 1 y = X 1 y, 
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and the ridge regression gives 



e 



R 



l + A 



0LS, 



(16) 



that is, every component of the LS estimator is simply shrunk by the same 
factor, j^j. 

In the case of the l\ regularization, the minimized Lagrangian function 
is no more differentiable, due to the presence of the absolute values in the l\ 
norm. So, in this case, we have to consider the notion of the subdifferential 
(see Appendix). It is known that if the zero vector belongs to the subd- 
ifferential set of a convex function at a point, this means that this point 
corresponds to a minimum of the function. Taking the subdifferential of 



the Lagrangian defined in (13) and recalling that the subdifferential of a 



differentiable function includes only the respective gradient, we obtain that 



G -2X T y + 2X T X0 + Xd ||0| 



i • 



where d stands for the subdifferential operator (see Appendix). If X has 
orthonormal columns, the previous equation can be written component-wise 
as follows 

o g -e LS ,i + e u + -d 



Vi, 



(17) 



where the subdifferential of the function | • | , derived in Appendix, is given 
as 

'{!}, if0>O, 
{-!}, if0<O, 
U-1,1], if = 0. 
Thus, we can now write 



d\e\ 



_ A 

i ~ 

A 
2' 



if e lti > o, 
if §u < o. 



(18) 
(19) 



Notice that (18) can only be true if #ls,« > 2' an< ^ on ^ ^ ^ L S,« < — 2- 



Moreover, in the case where = 0, then (17) and the subdifferential of 
| • | suggest that necessarily #ls,? 
compact way that 



< ^. Concluding, we can write in a more 



h,i = sgn(& LS ,i 



#LS,i 



(20) 



where (•)+ denotes the "positive part" of the respective argument; it is 
equal to the argument if this is non-negative, and zero otherwise. This is 
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Figure 4: Output-input curves for the hard thresholding, soft threshold- 
ing operators together with the linear operator associated with the ridge 
regression, for the same value of A = 1. 

very interesting indeed. In contrast to the ridge regression that shrinks all 
coordinates of the unregularized LS solution by the same factor, LASSO 
forces all coordinates, whose absolute value is less than or equal to A/2, to 
zero, and the rest of the coordinates are reduced, in absolute value, by the 
same amount A/2. This is known as soft thresholding, to distinguish it from 
the hard thresholding operation; the latter is defined as 9 ■ X(o,oo) (1^1 — 
9 6 1Z, where X(o,oo)(") stands for the characteristic function with respect 
to the set (0,oo). Fig. [4] shows the graphs illustrating the effect that the 
ridge regression, LASSO and hard thresholding have on the unregularized LS 
solution, as a function of its value (horizontal axis). Note that our discussion 
here, simplified via the orthonormal input matrix case, has quantified what 
we had said before about the tendency of the i\ norm to push small values to 
become exactly zero. This will be further strengthened, via a more rigorous 
mathematical formulation, in Section [6j 

Example 1. Assume that the unregularized LS solution, for a given regres- 
sion task, y = X9 + rj, is given by: 

LS = [0.2,-0.7,0.8,-0.1,1.0] T 

Derive the solutions for the corresponding ridge regression and l\ norm 
regularization tasks. Assume that the input matrix X has orthonormal 
columns and that the regularization parameter is A = 1. Also, what is the 
result of hard thresholding the vector #ls with threshold equal to 0.5? 
We know that the corresponding solution for the ridge regression is 

Or = j^As = I - 1 ' -°- 35 ' °' 4 ' -°- 05 ' °- 5 ] T 

The solution for the i\ norm regularization is given by soft thresholding, 
with threshold equal to A/2 = 0.5, hence the corresponding vector is 




01 = [0,-0.2, 0.3,0, 0.5] T . 
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The result of the hard thresholding operation is the vector [0, —0.7, 0.8, 0, 1.0] r . 
Remarks 1. 

• The hard and soft thresholding rules are only two possibilities out of a 
larger number of alternatives. Note that the hard thresholding oper- 
ation is defined via a discontinuous function and this makes this rule 
to be unstable, in the sense of being very sensitive to small changes of 
the input. Moreover, this shrinking rule tends to exhibit large variance 
in the resulting estimates. The soft thresholding rule is a continuous 
function, but, as it is readily seen from the graph in Fig. [4| it intro- 
duces bias even for the large values of the input argument. In order 
to ameliorate such shortcomings, a number of alternative threshold- 
ing operators have been introduced and studied both theoretically and 
experimentally. Although these are not within the mainstream of our 
interest, we provide two popular examples for the sake of completeness; 
the Smoothly Clipped Absolute Deviation (SCAD): 



#SCAD 



f sgn(#) - A S cad)+ , \0\ < 2A S CAD, 

(q - 1)6* - aAscAD sgn(0) 



-, 2Ascad < \0\ < aAscAD, 



a - 2 

\9\ > aAscAD, 



and the nonnegative garrote thresholding rule : 

f0, |#|<A garr , 

[0 ^p> M > ^garr- 

Fig. [5] shows the respective graphs. Observe that, in both cases, an 
effort has been made to remove the discontinuity (associated with the 
hard thresholding) and to remove/reduce the bias for large values of 
the input argument. The parameter a is a user-defined one. For a 
more detailed discussion on this topic, the interested reader can refer, 
for example, to |Anto 07] . 



5 Sparse Signal Representation 

In the previous section, we brought into our discussion the need for taking 
special care for zeros. Sparsity is an attribute that is met in a plethora 
of natural signals, since nature tends to be parsimonious. In this section, 
we will briefly present a number of application cases, where the existence 
of zeros in a mathematical expansion is of paramount importance, hence it 
justifies to further strengthen our search for and developing related analysis 
tools. 
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Figure 5: Output-input graph for the SCAD and nonnegative garotte rules 
with parameters a = 3.7, and Ascad = A garr = 1. Observe that both rules 
smooth out the discontinuity associated with the hard thresholding rule. 
Notice, also, that the SCAD rule removes the bias, associated with the soft 
thresholding rule, for large values of the input variable. On the contrary, 
the garrote thresholding rule allows some bias for large input values, which 
diminishes as A garr gets smaller and smaller. 



Echo cancelation is a major task in Communications. In a number of 
cases, the echo path, represented by a vector comprising the values of the 
impulse response samples, is a sparse one. This is the case, for example, in 

[N^d04 



internet telephony and in acoustic and network environments, e.g. 



Bene 01 Aren 09] , Fig. [6] shows the impulse response of such an echo path 
The impulse response of the echo path is of short duration; however, the 
delay with which it appears is not known. So, in order to model it, one 
has to use a long impulse response, yet only a relatively small number of 
the coefficients will be significant and the rest will be close to zero. Of 
course, one could ask why not use an LMS or an RLS Hayk 96 , Saye 03 and 
eventually the significant coefficients will be identified. The answer is that 
this turns out not to be the most efficient way to tackle such problems, 
since the convergence of the algorithm can be very slow. In contrast, if one 
embeds, somehow, into the problem the a-priori information concerning the 
existence of (almost) zero coefficients, then the convergence speed can be 
significantly increased and also better error floors can be attained. 

A similar situation, as in the previous case, occurs in wireless commu- 
nication systems, which involve multipath channels. A typical application 
is in high definition television (HDTV) systems, where the involved com- 
munications channels consist of a few non-negligible echoes, some of which 
may have quite large time delays with respect to the main signal, see, e.g. 
|Ghos 98 Cott 00 Ariy 97 Rond 03] . If the information signal is transmitted 
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0.2 0.4 0.6 0.8 1 

t [sec] 

Figure 6: The impulse response function of an echo-path in a telephone 
network. Observe that although it is of relatively short duration, it is not 
a-priori known where exactly in time will occur. 



at high symbol rates through such a dispersive channel, then the introduced 
intersymbol interference (ISI) has a span of several tens up to hundreds of 
symbol intervals. This in turn implies that quite long channel estimators are 
required at the receiver's end in order to reduce effectively the ISI component 
of the received signal, although only a small part of it has values substan- 
tially different to zero. The situation is even more demanding whenever the 
channel frequency response exhibits deep nulls. More recently, sparsity has 
been exploited in channel estimation for multicarrier systems, both for single 



antenna as well as for MIMO systems [Eiwe 10a,Eiwe 10b . A thorough and 



in depth treatment related to sparsity in multipath communication systems 



is provided in Bajw 10 



Another example, which might be more widely known, is that of sig- 
nal compression. It turns out that if the signal modalities, with which we 
communicate, e.g., speech, and also we sense the world, e.g., images, audio, 
are transformed into a suitably chosen domain then they are sparsely rep- 
resented; only a relatively small number of the signal components in this 
domain are large and the rest are close to zero. As an example, Fig. [7^, 
shows an image and Fig. [Tja the plot of the magnitude of the obtained Dis- 
crete Cosine Transform (DCT) components, which are computed by writing 
the corresponding image array as a vector in lexicographic order. Note that 
more than 95% of the total energy is contributed by only the 5% of the 
largest components. This is at the heart of any compression technique. 
Only the large coefficients are chosen to be coded and the rest are consid- 
ered to be zero. Hence, significant gains are obtained in memory /bandwidth 
requirements while storing/transmitting such signals, without much percep- 
tual loss. Depending on the modality, different transforms are used. For 
example, in JPEG-2000, an image array, represented in terms of a vector 
that contains the intensity of the gray levels of the image pixels, is trans- 
formed via the discrete wavelet transform (DWT) and results to a transform 
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(a) 

Figure 7: (a) A 512 x 512 pixel image and (b) The magnitude of its Discrete 
Cosine Transform components in descending order and logarithmic scale. 
Note that more than 95% of the total energy is contributed by only the 5% 
of the largest components 



vector that comprises only a few large components. Such an operation is of 
the form 

S = $ H s, s,SeC l , (21) 

where s is the vector of the "raw" signal samples, S the vector of the trans- 
formed ones, and is the I x I transformation matrix. Often, this is an 
orthonormal matrix, = I. Basically, a transform is nothing else than 

a projection of a vector on a new set of coordinate axes, which comprise the 
columns of the transformation matrix Celebrated examples of such trans- 
forms are the wavelet, the discrete Fourier (DFT) and the discrete cosine 
(DCT) transforms, e.g., Theo 09 . In such cases, where the transformation 



matrix is orthonormal, one can write that 



(22) 



where \P = Equation (21) is known as the analysis and (22) as the 
synthesis equation. 

Compression via such transforms exploit the fact that many signals in 
nature, which are rich in context, can be compactly represented in an ap- 
propriately chosen basis, depending on the modality of the signal. Very 
often, the construction of such bases tries to "imitate" the sensory systems 
that the human (and not only) brain has developed in order to sense these 
signals; and we know that nature (in contrast to modern humans) does not 
like to waste resources. A standard compression task comprises the following 
stages: a) Obtain the I components of S, via the analysis step (|2l|), b) keep 
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the, say, k most significant of them, c) code these values, as well as their 
respective locations in the transform vector S, and d) obtain the (approx- 
imate) original signal s, when needed (after storage or transmission), via 
the synthesis equation (22), where in place of S only its k most significant 
components are used, which are the ones that were coded, while the rest are 
set equal to zero. However, there is something unorthodox in this process 
of compression, as it has been practised till very recently. One processes 
(transforms) large signal vectors of I coordinates, where I in practice can be 
quite large, and then uses only a small percentage of the transformed coeffi- 
cients and the rest are simply ignored. Moreover, one has to store/transmit 
the location of the respective large coefficients that were finally coded. A 
natural question that is now raised is the following: Since S in the synthesis 
equation is (approximately) sparse, can one compute it via an alternative 
path than the analysis equation in (21)? The issue here is to investigate 
whether one could use a more informative way of obtaining measurements 
from the available raw data, so that less than I measurements are sufficient 
to recover all the necessary information. The ideal case would be to be 
able to recover it via a set of k such measurement samples, since this is 
the number of the significant free parameters. On the other hand, if this 
sounds a bit extreme, can one obtain N (k < N < I) such signal-related 
measurements, from which one can obtain the k needed components of S? 
It turns out that such an approach is possible and it leads to the solution 
of an underdetermined system of linear equations, under the constraint that 
the unknown target vector is a sparse one. The importance of such tech- 
niques becomes even more apparent when, instead of an orthonormal basis, 
as discussed before, a more general type of expansion is adopted, in terms 
of what is known as overcomplete dictionaries. 

A dictionary Mall 93| is a collection of parameterized waveforms, which 
are discrete-time signal samples, represented as vectors tpi 6 C l , i 6 I. For 
example, the columns of a DFT or a DWT matrix comprise a dictionary. 
These are two examples of what is known as complete dictionaries, which 
consist of / (orthonormal) vectors, i.e., a number equal to the length of the 
signal vector. However, in many cases in practice, using such dictionaries is 
very restrictive. Let us take, for example, a segment of audio signal, from a 
news media or a video, that needs to be processed. This consists, in general, 
of different types of signals, namely speech, music, environmental sounds. 
For each type of these signals, different signal vectors (dictionaries) may be 
more appropriate in the expansion for the analysis. For example, music sig- 
nals are characterized by a strong harmonic content and the use of sinusoids 
seems to be best for compression, while for speech signals a Gabor type 
signal expansion (sinusoids of various frequencies weighted by sufficiently 
narrow pulses at different locations in time, |Coif 92Theo 09] ), may be a 
better choice. The same applies when one deals with an image. Different 
parts of an image, e.g., parts which are smooth or contain sharp edges, may 
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demand a different expansion vector set, for obtaining the best overall per- 
formance. The more recent tendency, in order to satisfy such needs, is to use 
overcomplete dictionaries. Such dictionaries can be obtained, for example, 
by concatenating different dictionaries together, e.g., a DFT and a DWT 
matrix to result in a combined I x 21 transformation matrix. Alternatively, 
a dictionary can be "trained" in order to effectively represent a set of avail- 
able signal exemblars, a task which is often referred to as dictionary learning 



Tosi U Rubi lO Yagh 09 . While using such overcomplete dictionaries, the 



synthesis equation takes the form 

" (23) 



Note that, now, the analysis is an ill-posed problem, since the elements 
{ipi}iex (usually called atoms) of the dictionary are not linearly indepen- 
dent, and there is not a unique set of coefficients {9i}i<=x which generates 
s. Moreover, we expect most of these coefficients to be (nearly) zero. Note 
that, in such cases, the cardinality of I is larger than I. This necessarily 
leads to underdetermined systems of equations with infinite many solutions. 
The question that is now raised is whether we can exploit the fact that most 
of these coefficients are known to be zero, in order to come up with a unique 
solution, and if yes, under which conditions such a solution is possible? 

Besides the previous examples, there is a number of cases where an 
underdetermined system of equations is the result of our inability to obtain 
a sufficiently large number of measurements, due to physical and technical 
constraints. This is for example the case in MRI imaging, which will be 
presented in more detail later on. 



6 In Quest for the Sparsest Solution 



Inspired by the discussion in the previous section, we now turn our attention 
to the task of solving underdetermined systems of equations, by imposing 
the sparsity constraint on the solution Elad 10 . We will develop the the- 



oretical set up in the context of the regression task and we will adopt the 
notation that has been adopted for this task. Moreover, we will adhere 
to the real data case, in order to simplify the presentation. The theory 
can be readily extended to the more general complex data case, see, e.g., 
|Wrig 09b , Male 11 . We assume that we are given a set of measurements, 



V := [yii Vii ■ ■ • i Un] ^ i according to the linear model 



y = X6, y € TZ N , 6 G 1Z l , I > N, 



(24) 



where X is the N x I input matrix, which is assumed to be of full row rank, 
i.e., rank(X) = N . Our starting point is the noiseless case. The system in 



(24) is an underdetermined one and accepts an infinite number of solutions. 
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The set of possible solutions lies in the intersection of the ./V hyperplane^] 
in the /-dimensional space, 

{0 G n l : y n = xlG}, n = l,2,...,N. 

We know from geometry, that the intersection of ./V non-parallel hyperplanes 
(which in our case is guaranteed by the fact that X has been assumed to 
be full row rank, hence x n are mutually independent) is a plane of dimen- 
sionality I — N (e.g., the intersection of two (non-parallel) (hyper )planes in 
the 3-dimensional space is a straight line; that is, a plane of dimensionality 
equal to one). In a more formal way, the set of all possible solutions, to be 
denoted as O, is an affine set. An afhne set is the translation of a linear 
subspace by a constant vector. Let us pursue this a bit further, since we 
will need it later on. 

Let the null space of X be the set null(X), defined as the linear subspace 



null(X) = jz G n l : Xz = o} 



Obviously, if Oq is a solution to (24), i.e., Oq G O, then it is easy to verify 



that WO G 0, X(0 - Oq) = 0, or 6 - G null(X). As a result, 

= o + null(X), 

and is an affine set. We also know from linear algebra basics, that the null 
space of a full row rank matrix, N x I, I > N, is a subspace of dimensionality 
I - N. Fig. [8] illustrates the case for one measurement sample in the 2- 
dimensional space, I = 2 and N = 1. The set of solutions is a line, which 
is the translation of the linear subspace crossing the origin (the null(X)). 
Therefore, if one wants to determine a single point that lies in the affine 
set of solutions, 0, then an extra constraint/a-priori knowledge has to be 
imposed 

In the sequel, three such possibilities are examined. 



6.0.1 The £2 Norm Minimizer 

Our goal now becomes to pick a point in (the affine set) 0, that corre- 
sponds to the minimum £2 norm. This is equivalent to solving the following 
constrained task 

mm "9 

s.t. xl6 = y n: n = l,2,...,N. (25) 

The previous optimization task accepts a unique solution given in closed 
form as 

= X T (XX 7 )' 1 y. (26) 
2 In 1Z L , a hypcrplanc is of dimension I — 1. A plane has dimension lower than I — 1. 
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Figure 8: (a) The £2 norm minimizer. The dotted circle corresponds to 
the smallest £2 ball that intersects the set 0. As such, the intersection 



point, 0, is the I2 norm minimizer of the task (25). Notice that the vector 
6 contains no zero component, (b) The l\ norm minimizer. The dotted 
rhombus corresponds to the smallest £\ ball that intersects 0. Hence, the 
intersection point, 6, is the solution of the constrained i\ minimization task 
of (28). Notice that the obtained estimate 6 = (0, 1) contains a zero. 



The geometric interpretation of this solution is provided in Fig. [8^,, for the 
case of I = 2 and N = 1. The radius of the Euclidean norm ball keeps 
increasing, till it touches the plane that contains the solutions. This point 
is the one with the minimum £2 norm or, equivalently, the point that lies 
closest to the origin. Equivalently, the point can be seen as the (metric) 
projection of onto 0. 

Minimizing the £2 norm, in order to solve a linear set of underdetermined 
equations, has been used in various applications. The closest to us is in the 
context of determining the unknown coefficients in an expansion using an 
overcomplete dictionary of functions (vectors) |Daub 88 . A main drawback 



of this method is that it is not sparsity preserving. There is no guarantee that 



the solution in (26) will give zeros even if the true model vector 6 has zeros. 



Moreover, the method is resolution limited |Chen 98 . This means that, even 



if there may be a sharp contribution of specific atoms in the dictionary, this 
is not portrayed in the obtained solution. This is a consequence of the fact 
that the information provided by XX T is a global one, containing all atoms 
of the dictionary in an "averaging" fashion, and the final result tends to 
smooth out the individual contributions, especially when the dictionary is 
overcomplete. 

6.0.2 The £ Norm Minimizer 

Now we turn our attention to the £q norm (once more, it is pointed out that 
this is an abuse of the definition of the norm, as stated before), and we make 
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sparsity our new flag under which a solution will be obtained. Recall from 
Section [5] that such a constraint is in line with the natural structure that 
underlies a number of applications. The task now becomes 

min ||0||n 

s.t. xl6 = y n , n=l,2,...,N, (27) 

that is, from all the points that lie on the plane of all possible solutions find 
the sparsest one; i.e., the one with the least number of nonzero elements. 
As a matter of fact, such an approach is within the spirit of Occam's razor 
rule. It corresponds to the smallest number of parameters that can explain 
the obtained measurements. The points that are now raised are: 

• Is a solution to this problem unique and under which conditions? 

• Can a solution be obtained with low enough complexity in realistic 
time? 

We postpone the answer to the first question later on. As for the second one, 
the news is no good. Minimizing the £q norm under a set of linear constraints 
is a task of combinatorial nature and as a matter of fact the problem is, in 



general, NP-hard |Nata 95 . The way to approach the problem is to consider 



all possible combinations of zeros in 0, removing the respective columns of 



X in (24) and check whether the system of equations is satisfied; keep as 
solutions the ones with the smallest number of nonzero elements. Such a 
searching technique exhibits complexity of an exponential dependence on I. 
Fig. [8^i illustrates the two points ((1.5, 0) and (0, 1)) that comprise the solu- 
tion set of minimizing the £q norm for the single measurement (constraint) 
case. 



6.0.3 The £\ Norm Minimizer 

The current task is now given by 

min H^IL 

s.t. xlO = y n , n = l,2,...,N. (28) 

Fig. [8)3 illustrates the geometry. The £\ ball is increased till it touches the 
affine set of the possible solutions. For this specific geometry, the solution 
is the point (0, 1). In our discussion in Section [3j we saw that the l\ norm 
is the one, out of all £ p , p > 1 norms, that bears some similarity with the 
sparsity favoring (nonconvex) £ p , p < 1 "norms". Also, we have commented 
that the £\ norm encourages zeros, when the respective values are small. 
In the sequel, we will state one lemma, that establishes this zero-favoring 
property in a more formal way. The l\ norm minimizer is also known as 
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Basis Pursuit and it was suggested for decomposing a vector signal in terms 



of the atoms of an overcomplete dictionary Chen 98 



The l\ minimizer can be brought into the standard Linear Programming 
(LP) form and then can be solved by recalling any related method; the sim- 
plex method or the more recent interior point methods are two possibilities, 
see, e.g., |Boyd 04 , Pant 63 . Indeed, consider the (LP) task 



mm 

X 

S.t. 



T 
C X 



Ax = b 
x > 0. 



To verify that our i\ minimizer can be cast in the previous form, notice first 
that any /-dimensional vector 6 can be decomposed as 

6 = u — v, u > 0, v > 0. 

Indeed, this holds true if, for example, 

u := 6 + , v := (-0) + , 

where x+ stands for the vector obtained after taking the positive parts of 
the components of x. Moreover, notice that 



\e\ 



[M,..., i] 



-0) 



[1,1,..., 1] 



Hence, our l\ minimization task can be recast in the LP form, if 
c:= [1,1,..., 1] T , x:=[u T ,v T ] T , 



A:= [X,-X], 



y- 



6.0.4 Characterization of the l\ norm minimizer 

Lemma 1. An element 6 in the affine set, O, of the solutions of the un- 



derdetermined linear system ( 24 ) , has minimal i\ norm if and only if the 
following condition is satisfied: 



^2 sgn ( 



< 



E 

i: 6»i=0 



Mz G null(X). 



(29) 



Moreover, the t\ minimizer is unique if and only i/the inequality in (29) is 



a strict one for all z ^ (see, e.g., Pink 89 



Remarks 2. The previous lemma has a very interesting and important 
consequence. If 6 is the unique minimizer of (28), then 

card{i : 0« = 0} > dim(null(X)), (30) 
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where card{-} denotes the cardinality of a set. In words, the number of zero 
coordinates of the unique minimizer cannot be smaller than the dimension of 
the null space of X. Indeed, if this is not the case, then the unique minimizer 
could have less zeros than the dimensionality of null(X). As it can easily be 
shown, this means that we can always find az£ null(Jf), which has zeros 
in the same locations where the coordinates of the unique minimizer are 
zero, and at the same time it is not identically zero, i.e., 2 / 0. However, 



this would violate (29), which in the case of uniqueness holds as a strict 
inequality. 

Definition 1. A vector 6 is called /c-sparse if it has at most k nonzero 
components. 



Remarks 3. If the minimizer of (28) is unique, then it is a /c-sparse vector 
with 

k < N. 

This is a direct consequence of the Remark [2j and the fact that for the 
matrix X, 

dim(null(X)) = I - rank(X) = I - N. 

Hence, the number of the nonzero elements of the unique minimizer must 
be at most equal to N. 

If one resorts to geometry, all the previously stated results become crystal 
clear. 



6.0.5 Geometric interpretation 

Assume that our target solution resides in the 3-dimensional space and that 
we are given one measurement 

Vl = xJO = 27X101 + Xl 2 02 + Xl 3 6 3 . 

Then the solution lies in the 2-dimensional (hyper)plane, which is described 
by the previous equation. To get the minimal l\ solution we keep increasing 
the size of the l\ balj^j (the set of all points that have equal i\ norm) till it 
touches this plane. The only way that these two geometric objects have a 
single point in common (unique solution) is when they meet at a corner of the 
diamond. This is shown in Fig. [9^. In other words, the resulting solution is 
1-sparse, having two of its components equal to zero. This complies with the 
finding stated in Remark [3j since now N = 1. For any other orientation of 
the plane, this will either cut across the l\ ball or will share with the diamond 
an edge or a side. In both cases, there will be infinite many solutions. 
Let us now assume that we are given an extra measurement, 

2/2 = X2101 + X22O2 + £23#3- 
3 Observe that in the 3-dimensional space the l\ ball looks like a diamond. 
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(a) (b) 

Figure 9: (a) The l\ ball intersecting with a plane. The only possible 
scenario, for the existence of a unique common intersecting point of the l\ 
ball with a plane in the Euclidean 1Z 3 space, is for the point to be located 
at one of the corners of the t\ ball, i.e., to be an 1-sparse vector, (b) The 
t\ ball intersecting with lines. In this case, the sparsity level of the unique 
intersecting point is relaxed; it could be an 1- or a 2-sparse vector. 



The solution now lies in the intersection of the two previous planes, which 
is a straight line. However, now, we have more alternatives for a unique 
solution. A line, e.g., 0i, can either touch the t\ ball at a corner (1-sparse 
solution) or, as it is shown in Fig. [9)3, it can touch the l\ ball at one of its 
edges, e.g., 62- The latter case, corresponds to a solution that lies on a 2- 
dimensional subspace, hence it will be a 2-sparse vector. This also complies 
with the findings stated in Remark [3] since in this case, we have N = 2, 
I = 3 and the sparsity level for a unique solution can be either 1 or 2. 

Note that uniqueness is associated with the particular geometry and 
orientation of the affine set, which is the set of all possible solutions of the 
underdetermined system of equations. For the case of the square £2 norm, 
the solution was always unique. This is a consequence of the (hyper)spherical 
shape formed by the Euclidean norm. From a mathematical point of view, 
the square I2 norm is a strict convex function. This is not the case for the 
i\ norm, which is convex, albeit not a strict convex function. 

Example 2. Consider a sparse vector parameter [0, which we assume 
to be unknown. We will use one measurement to sense it. Based on this 



single measurement, we will use the l\ minimizer of (28) to recover its true 
value. Let us see what happens. We will consider three different values of the 
"sensing" (input) vector x in order to obtain the measurement y = x T 0: a) 
x = 1] T , b) x = [1, 1] T , and c) x = [2, 1] T . The resulting measurement, 
after sensing 6 by x, is y = 1 for all the three previous cases. 
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Case a): The solution will lie on the straight line 

@ = l[e 1 ,e 2 f en 2 ^e x + e 2 = i). 



which is shown in Fig. 10 1. For this setting, expanding the l\ ball, this 
will touch the line (our solutions' affine set) at the corner [0,1] T . This is a 
unique solution, hence it is sparse, and it coincides with the true value. 
Case b): The solutions lies on the straight line 

e = {[e 1 ,d 2 f gtz 2 ■.e 1 + e 2 = i}, 



which is shown in Fig. [TUp . For this set up, there is an infinite number of 
solutions, including two sparse ones. 

Case c): The affine set of solutions is described by 

e = {[9 1 ,8 2 f eTZ 2 :29 l + 6 2 = l}, 



which is sketched in Fig. [TOp . The solution in this case is sparse, but it is 
not the correct one. 

This example is quite informative. If we sense (measure) our unknown 
parameter vector with appropriate sensing (input) data, the use of the l\ 
norm can unveil the true value of the parameter vector, even if the system 
of equations is under determined, provided that the true parameter is sparse. 
This now becomes our new goal. To investigate whether what we have 
just said can be generalized, and under which conditions holds true, if it 
does. In such a case, the choice of the regressors (which we just called them 
sensing vectors) and hence the input matrix (which, from now on, we will 
refer to, more and more frequently, as the sensing matrix) acquire an extra 
significance. It is not enough for the designer to care only for the rank of 
the matrix, i.e., the linear independence of the sensing vectors. One has 
to make sure that the corresponding affine set of the solutions has such an 
orientation, so that the touch with the t\ ball, as this increases from zero 
to meet this plane, is a "gentle" one, i.e., they meet at a single point, and 
more important at the correct one; that is, at the point that represents the 
true value of the sparse parameter, which we are searching for. 



Remarks 4. 

• Often in practice, the columns of the input matrix, X, are normalized 
to unit l 2 norm. Although £q norm is insensitive to the values of the 
nonzero components of 6, this is not the case with the i\ and l 2 norms. 
Hence, while trying to minimize the respective norms, and at the same 
time to fulfill the constraints, components that correspond to columns 
of X with high energy (norm) are favored more than the rest. Hence, 
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(c) 

Figure 10: (a) Sensing with x = [2,l] r , (b) sensing with x = [1,1] T , (c) 
sensing with x = [2, 1] T . The choice of the sensing vector x is crucial to 
unveiling the true sparse solution (0, 1). Only the sensing vector x = [~, 1] T 
identifies uniquely the desired (0, 1). 

the latter become more popular candidates to be pushed to zero. In 
order to avoid such situations, the columns of X are normalized to 
unity, by dividing each element of the column vector by the respective 
(Euclidean) norm. 

7 Uniqueness of the £q Minimizer 

Our first goal is to derive sufficient conditions that guarantee uniqueness of 
the do minimizer, which has been defined in Section [6| 

Definition 2. The spark of a full rank N X I (I > N) matrix, X , denoted 
as spark(X), is the smallest number of its linearly dependent columns. 

According to the previous definition, any m < spark(X) columns of X 
are, necessarily, linearly independent. The spark of a square, N x N, full 
rank matrix is equal to iV + 1. 

Remarks 5. 

• In contrast to the rank of a matrix, which can be easily determined, 
its spark can only be obtained by resorting to a combinatorial search 
over all possible combinations of the columns of the respective matrix, 
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see, e.g., [Bruc 09 Dono 03 . The notion of the spark was used in 
the context of sparse representation, under the name of Uniqueness 
Representation Property, in |Goro 97 . The name "spark" was coined 
An interesting discussion relating this matrix index with 



in I Dono 03 



other indices, used in other disciplines, is given in Br uc 09| . 
Example 3. Consider the following matrix 



X 



1 1 

10 11 

1 1 

1 



The matrix has rank equal to 4 and spark equal to 3. Indeed, any pair of 
columns are linearly independent. On the other hand, the first, the second 
and the fifth columns are linearly dependent. The same is also true for the 
combination of the second, third and sixth columns. 

Lemma 2. If null(X) is the null space of X, then 

||0|| o > spark(X), V0 € null(X), 0/0. 

Proof: To derive a contradiction, assume that there exists a / £ null(X) 
such that ||0|| o < spark(X). Since by definition X6 = 0, there exists a 
number of ||0|| o columns of X that are linearly dependent. However, this 
contradicts the minimality of spark(X), and the claim of Lemma [2] is estab- 
lished. 

Lemma 3. If a linear system of equations, X6 = y, has a solution that 
satisfies 

l|0|lo < 2 s P ark P0' 

then this is the sparsest possible solution. In other words, this is, necessarily, 
the unique solution of the lo minimizer. 

Proof: Consider any other solution h / 0. Then, — h € null(X), i.e., 

X(6 -h) = 0. 

Thus, according to Lemma [2j 



spark(X) < \\0 



h\\ < 



|0|lo + 



h\\ 



(31) 



Observe that although the £q "norm" is not a true norm, it can be readily 
verified by simple inspection and reasoning that the triangular property 
is satisfied. Indeed, by adding two vectors together, the resulting number 
of nonzero elements will always be at most equal to the total number of 
nonzero elements of the two vectors. Therefore, if ||0|| o < \ spark(X), then 
(31 ) suggests that 

ll^llo > 2 s P ark (^) > ||0|lo- 
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Remarks 6. 

• Lemma [3] is a very interesting result. We have a sufficient condition to 
check whether a solution is the unique optimal in a, generally, NP-hard 
problem. Of course, although this is nice from a theoretical point of 
view, is not of much use by itself, since the related bound (the spark) 
can only be obtained after a combinatorial search. Well, in the next 
section, we will see that we can relax the bound by involving another 
index, in place of the spark, which can be easily computed. 

• An obvious consequence of the previous lemma is that if the unknown 
parameter vector is a sparse one with k nonzero elements, then if 
matrix X is chosen so that to have spark(X) > 2k, then the true 
parameter vector is necessarily the sparsest one that satisfies the set 
of equations, and the (unique) solution to the 0.$ minimizer. 

• In practice, the goal is to sense the unknown parameter vector by a 
matrix that has as high a spark as possible, so that the previously 
stated sufficiency condition to cover a wide range of cases. For exam- 
ple, if the spark of the input matrix is, say, equal to three, then one 
can check for optimal sparse solutions up to a sparsity level of k = 1. 
From the respective definition, it is easily seen that the values of the 
spark are in the range 1 < spark(X) < N + 1. 

• Constructing an N x I matrix X in a random manner, by generating 
i.i.d entries, guarantees, with high probability, that spark(X) = N + l; 
that is, any N columns of the matrix are linearly independent. 



7.1 Mutual Coherence 

Since the spark of a matrix is a number that is difficult to compute, our 
interest shifts to another index, which can be derived easier and at the same 
time can offer a useful bound on the spark. The mutual coherence of an 



N x / matrix X Mall 93 , denoted as fi(X), is defined as 



H(X) := max „ 1 i n jl in (32) 

l<«<i<' ||Xj|| ||Xj|| 

where Xj, i = 1,2, ... ,1, denote the columns of X (notice the difference in 
notation between a row xj and a column Xj of the matrix X). This num- 
ber reminds us of the correlation coefficient between two random variables. 
Mutual coherence is bounded as < n{X) < 1. For a square orthogonal 
matrix, X, fi(X) = 0. For general matrices, with I > N, n(X) satisfies 



' l — N 
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which is known as the Welch bound [Welc 74 . For large values of I, the lower 
bound becomes, approximately, [t(X) > Common sense reasoning 

guides us to construct input (sensing) matrices of mutual coherence as small 
as possible. Indeed, the purpose of the sensing matrix is to "measure" 
the components of the unknown vector and "store" this information in the 
measurement vector y. Thus, this should be done in such a way so that y to 
retain as much information about the components of as possible. This can 
be achieved if the columns of the sensing matrix, X, are as "independent" 
as possible. Indeed, y is the result of a combination of the columns of X, 
each one weighted by a different component of 0. Thus, if the columns 
are as much "independent" as possible then the information regarding each 
component of is contributed by a different direction making its recovery 
easier. This is easier understood if X is a square orthogonal matrix. In the 
more general case of a non-square matrix, the columns should be made as 
"orthogonal" as possible. 

Example 4. Assume that X is an ./V x 2N matrix, formed by concatenating 
two orthonormal bases together, 

X = [I,W], 

where I is the identity matrix, having as columns the vectors ej, % = 
1, 2, . . . , N, with elements equal to 




for r = 1, 2, . 

as 



. , N. The matrix W is the orthonormal DFT matrix, defined 



W 



1 1 

1 W N 



1 

N-l 



N 



i w"- 1 ... wp-w-v 



where 



W N := exp - j 



.2tt 



Such an overcomplete dictionary could be used to represent signal vectors in 



terms of the expansion in (23), that comprise the sum of sinusoids with very 



narrow spiky-like pulses. The inner products between any two columns of / 
and between any two columns of W are zero, due to orthogonality. On the 
other hand, it is easy to see that the inner product between any column of 



/ and any column of W has absolute value equal to 



/N 



Hence, the mutual 



coherence of this matrix is fi(X) = 
of this matrix is spark(X) = N + 1. 



/AT 



Moreover, observe that the spark 
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Lemma 4. For any N x I matrix X, the following inequality holds 

sparkpO > 1 + (33) 



The proof is given in Dono 03 and it is based on arguments that stem 
from matrix theory applied on the Gram matrix, X T X, of X. A "superficial" 
look at the previous bound is that for very small values of n(X) the spark 
can be larger than N + 1\ Looking at the proof, it is seen that in such cases 
the spark of the matrix attains its maximum value N + 1. 

The result complies with a common sense reasoning. The smaller the 
value of n(X) the more independent are the columns of X, hence the higher 
the value of its spark is expected to be. Based on this lemma, we can now 
state the following theorem, first given in [Dono 03"] . Combining the way that 
Lemma [3] is proved and (33), we come to the following important theorem. 



Theorem 1. If the linear system of equations in (24) has a solution that 
satisfies the condition 



- (l — 

2\ 1 + KX) 

then this solution is the sparsest one. 
Remarks 7. 



Alio < 7i ( 1 + ) > (34) 



The bound in (34) is "psychologically" important. It relates an easily 
computed bound to check whether the solution to a NP-hard task is 
the optimal one. However, it is not a particularly good bound and it 
restricts the range of values in which it can be applied. As we saw in 
Example |4j while the maximum possible value of the spark of a matrix 
was equal to N + 1, the minimum possible value of the mutual coher- 
ence was -j=. Therefore, the bound based on the mutual coherence 
restricts the range of sparsity, i.e., [|0[|q, where one can check optimal- 
ly, to around ^y/~N. Moreover, as the previously stated Welch bound 
suggests, this 0(^=) dependence of the mutual coherence seems to 
be a more general trend and not only the case for Example |4j see, 
e.g., |Dono 01 . On the other hand, as we have already stated in the 



Remarks [6] , one can construct random matrices with spark equal to 
N + 1; hence, using the bound based on the spark, one could expand 
the range of sparse vectors up to hN. 



8 Equivalence of £q and t\ Minimizers: Sufficiency 
Conditions 

We have now come to the crucial point and we will establish the conditions 
that guarantee the equivalence between the l\ and the £q minimizers. Hence, 



8.1 Condition Implied by the Mutual Coherence Number 



30 



under such conditions, a problem, that is in general NP-hard problem, can 
be solved via a tractable convex optimization task. Under these conditions, 
the zero value encouraging nature of the l\ norm, that has already been 
discussed, obtains a much higher stature; it provides the sparsest solution. 



8.1 Condition Implied by the Mutual Coherence Number 
Theorem 2. Let the underdetermined system of equations 

y = xe, 

where X is an N x / [N < I) full row rank matrix. If a solution exists and 
satisfies the condition 



If 1 



|0Ho< oU + TT^h ( 35 ) 



then this is the unique solution of both, the Iq as well the l\ minimizers. 

This is a very important theorem and it was shown independently in 
|Dono 03 , Grib 03 . Earlier versions of the theorem addressed the special 



case of a dictionary comprising two orthonormal bases, |Dono OlElad 02 . 



A proof is also summarized in [Bruc 09 . This theorem established, for a 
first time, what it was till then empirically known: often, the l\ and £q 
minimizers result in the same solution. 

Remarks 8. 

• The theory that we have presented so far is very satisfying, since it 
offers the theoretical framework and conditions that guarantee unique- 
ness of a sparse solution to an underdetermined system of equations. 
Now we know that, under certain conditions, the solution, which we 
obtain by solving the convex l\ minimization task, is the (unique) 
sparsest one. However, from a practical point of view, the theory, 
which is based on mutual coherence, does not say the whole story 
and falls short to predict what happens in practice. Experimental evi- 
dence suggests that the range of sparsity levels, for which the £q and l\ 
tasks give the same solution, is much wider than the range guaranteed 
by the mutual coherence bound. Hence, there is a lot of theoretical 
happening in order to improve this bound. A detailed discussion is 
beyond the scope of this paper. In the sequel, we will present one of 
these bounds, since it is the one that currently dominates the scene. 
For more details and a related discussion the interested reader may 



consult, e.g., Dono 10b 
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8.2 The Restricted Isometry Property (RIP) 

Definition 3. For each integer k = 1, 2, . . ., define the isometry constant 5k 
of an N x / matrix X as the smallest number such that 

(i - h) \\e\\l < \\X0\\1 < (i + s k ) \\e\\l , (36) 

holds true for all £;-sparse vectors 9. 



This definition was introduced in |Cand 05b . We loosely say that matrix 



X obeys the RIP of order k if 5k is not too close to one. When this prop- 
erty holds true, it implies that the Euclidean norm of 9 is approximately 
preserved, after projecting it on the rows of X. Obviously, if matrix X were 
orthonormal then 5k = 0. Of course, since we are dealing with non-square 
matrices this is not possible. However, the closer 5k is to zero, the closer 
to orthonormal all subsets of k columns of X are. Another view point of 



(36) is that it preserves Euclidean distances between /c-sparse vectors. Let 



us consider two /c-sparse vectors, 0\, 9 2 and apply (36) to their difference 



9\ — 92, which, in general, is a 2/c-sparse vector. Then we obtain 

(1 - 5 2k ) \\0i - 9 2 \\ 2 2 < \\X(9 1 - 9 2 )\\l < (1 + 5 2k ) - 9 2 f 2 . (37) 

Thus, when 5 2 k is small enough, the Euclidean distance is preserved after 
projection in the lower dimensional measurements' space. In words, if the 
RIP holds true, this means that searching for a sparse vector in the lower 
dimensional subspace formed by the measurements, 1Z N , and not in the 
original /-dimensional space, one can still recover the vector since distances 
are preserved and the target vector is not "confused" with others. After 
projection on the rows of X, the discriminatory power of the method is 
retained. It is interesting to point out that the RIP is also related to the 
condition number of the Grammian matrix. In Cand 05b Bara 08 , it is 



pointed out that if X r denotes the matrix that results by considering only 



r of the columns of X, then the RIP in (36) is equivalent with requiring 
the respective Grammian, X^X r , r < k, to have its eigenvalues within the 
interval [1 — 5k, 1 + 5k] ■ Hence, the more well conditioned the matrix is, the 
better is for us to dig out the information hidden in the lower dimensional 
measurements space. 

Theorem 3. Assume that for some k, 5 2 k < V% — 1. Then the solution to 



the l\ minimizer of ( 28 ) , denoted as 9* , satisfies the following two conditions 

\\9 — 9*\\ 1 < Co \\9 — 9k\\i , (38) 

and 

\\9-94 2 <C Q k--2\\9-9 k \\ l , (39) 
for some constant Cq. In the previously stated formulas, 9 is the true (target) 



vector that generates the measurements in (28) and 9k is the vector that 
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results from 6 if we keep its k largest components and set the rest equal to 
zero, 



Cand 05b,Cand 06c,Cand 08a, Cand 05a 



Hence, if the true vector is a sparse one, i.e., 6 = Ok, then the £\ min- 
imizer recovers the (unique) exact value. On the other hand, if the true 
vector is not a sparse one, then the minimizer results in a solution whose 
accuracy is dictated by a genie-aided procedure that knew in advance the 
locations of the k largest components of 6. This is a groundbreaking result. 
Moreover, it is deterministic, it is always true and not with high probability. 
Note that the isometry property of order 2k is used, since at the heart of 
the method lies our desire to preserve the norm of the differences between 
vectors. 

Let us now focus on the case where there is a £;-sparse vector that gener- 



ates the measurements, i.e., 6 = 0^. Then it is shown in Cand 05a| that the 
condition 82k < 1 guarantees that the £q minimizer has a unique fc-sparse 
solution. In other words, in order to get the equivalence between the £\ and 
£0 minimizers, the range of values for 82k has to be decreased to 82k < V2 — 1, 
according to Theorem [3j This sounds reasonable. If we relax the criterion 
and use £\ instead of £q, then the sensing matrix has to be more carefully 
constructed. Although we are not going to provide the proofs of these the- 
orems here, since their formulation is well beyond the scope of this paper, 
it is interesting to follow what happens if 82k = 1- This will give us a flavor 



of the essence behind the proofs. If 82k = 1 5 the left hand side term in (37) 
becomes zero. In this case, there may exist two /c-sparse vectors 0±, O2 such 
that X(0\ — O2) = 0, or X0\ = XO2. Thus, it is not possible to recover all 
/c-sparse vectors, after projecting them in the measurements space, by any 
method. 

The previous argument also establishes a connection between RIP and 
the spark of a matrix. Indeed, if 82k < 1, this guarantees that any number 
of columns of X up to 2k are linearly independent, since for any 2/c-sparse 



6, (36) guarantees that ||X{?|| 2 > 0. This implies that spark(X) > 2k. A 



connection between RIP and the coherence is established in [Cai 09b| , where 
it is shown that if X has coherence n(X), and unit norm columns, then X 
satisfies the RIP of order k with 8k, where 8k < (k — l)fi(X). 



8.2.1 Constructing Matrices that Obey the RIP of order k 

It is apparent from our previous discussion, that the higher the value of k, 
for which the RIP property of a matrix, X, holds true, the better, since a 
larger range of sparsity levels can be handled. Hence, a main goal towards 
this direction is to construct such matrices. It turns out that verifying the 
RIP for a matrix of a general structure is a difficult task. This reminds us of 
the spark of the matrix, which is also a difficult task to compute. However, it 
turns out that for a certain class of random matrices, the RIP follows fairly 
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easy. Thus, constructing such sensing matrices has dominated the scene of 
related research. We will present a few examples of such matrices, which are 
also very popular in practice, without going into details of the proofs, since 
this is out of our scope and the interested reader may dig this information 
from the related references. 

Perhaps, the most well known example of a random matrix is the Gaus- 
sian one, where the entries X(i,j) of the sensing matrix are i.i.d. realizations 
from a Gaussian pdf Af(0, i). Another popular example of such matrices is 
constructed by sampling i.i.d. entries from a Bernoulli, or related, distribu- 
tions 



1 



N' 



or 



X(i,j) 



N 



with probability 
with probability 



1 

2' 
1 

2' 



with probability -, 
6 



0, with probability 



with probability 



3' 
1 

6' 



Finally, one can adopt the uniform distribution and construct the columns 
of X by sampling uniformly at random on the unit sphere in 1Z N . It turns 
out, that such matrices obey the RIP of order k, with overwhelming proba- 
bility, provided that the number of measurements, N, satisfy the following 
inequality 

N>Ck\n(l/k), (40) 

where C is some constant, which depends on the isometry constant 5k- In 
words, having such a matrix at our disposal, one can recover a fc-sparse 
vector from N < I measurements, where N is larger than the sparsity level 
by an amount controlled by the inequality (40). More on these issues can 
be obtained from, e.g., Bara 08 Mend 08]. 



Besides random matrices, one can construct other matrices that obey 
the RIP. One such example includes the partial Fourier matrices, which are 
formed by selecting uniformly at random N rows drawn from the I x I DFT 
matrix. Although the required number of samples for the RIP to be satisfied 
may be larger than the bound in (40) (see, |Rude 08] ) , Fourier-based sensing 
matrices offer certain computational advantages, when it comes to storage 



(O(iVlnZ)) and matrix-vector products (C(/lnZ)), |Cand 06a . In Haup 10 



the case of random Toeplitz sensing matrices, containing statistical depen- 
dencies across rows, is considered and it is shown that they can also satisfy 
the RIP with high probability. This is of particular importance in signal 
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processing and communications applications, where it is very common for 
a system to be excited in its input via a time series, hence independence 
between successive input rows cannot be assumed. In [Rive 09 Duar 12 



the case of separable matrices is considered where the sensing matrix is the 
result of a Kronecker product of matrices, which satisfy the RIP individu- 
ally. Such matrices are of interest for multidimensional signals, in order to 
exploit the sparsity structure along each one of the involved dimensions. For 
example, such signals may occur while trying to "encode" information asso- 
ciated with an event whose activity spreads across the temporal, spectral, 
spatial, etc., domains. 

In spite of their theoretical elegance, the derived bounds, that determine 
the number of the required measurements for certain sparsity levels, fall 
short of what is the experimental evidence, e.g., Dono 10b . In practice, a 



rule of thumb is to use ./V of the order of Sk-5k, e.g., [Cand 05a . For large 
values of /, compared to the sparsity level, the analysis in Dono 06| suggests 
that we can recover most sparse signals when N ~ 2kln(l/N). In an effort 
to overcome the shortcomings associated with the RIP, a number of other 
techniques have been proposed, e.g. Cohe 09 Bick 09 Tang 11 , Dono 10b|. 



Furthermore, in specific applications, the use of an empirical study may be 
a more appropriate path. 

Note that, in principle, the minimum number of measurements that are 
required to recover a k sparse vector from N < I measurements is N > 2k. 
Indeed, in the spirit of the discussion after Theorem|3j the main requirement 
that a sensing matrix must fulfil is the following: not to map two different k- 
sparse vectors to the same measurement vector y. Otherwise, one can never 
recover both vectors from their (common) measurements. If we have 2k 
measurements and a sensing matrix that guarantees that any 2k columns are 
linearly independent, then the previously stated requirement is readily seen 
that it is satisfied. However, the bounds on the number of measurements set 
in order the respective matrices to satisfy the RIP are higher. This is because 
RIP accounts also for the stability of the recovery process. We will come to 
this issue soon, in Section 10, where we talk about stable embeddings. 



9 Robust Sparse Signal Recovery from Noisy Mea- 
surements 

In the previous section, our focus was on recovering a sparse solution from 
an underdetermined system of equations. In the formulation of the problem, 
we assumed that there is no noise in the obtained measurements. Having 
acquired a lot of experience and insight from a simpler problem, we now 
turn our attention to the more realistic task, where uncertainties come into 
the scene. One type of uncertainty may be due to the presence of noise and 
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our measurements' model comes back to the standard regression form 



y = XO + r l , (41) 
where X is our familiar non-square ./V x I matrix. A sparsity-aware formu- 



lation for recovering 6 from (41) can be cast as 



mm 
s.t. 



\e\ 



\y 



xe\\ z 2 < e, 



(42) 



which coincides with the LASSO task given in (15). Such a formulation 



implicitly assumes that the noise is bounded and the respective range of 
values is controlled by e. One can consider a number of different variants. 
For example, one possibility would be to minimize the ||-|| norm instead 



of the 



U' 



albeit loosing the computational elegance of the latter. An 



alternative route would be to replace the Euclidean norm in the constraints 
with another one. 

Besides the presence of noise, one could see the previous formulation 
from a different perspective. The unknown parameter vector, 0, may not 
be exactly sparse, but it may consist of a few large components, while the 
rest are small and close to, yet not necessarily equal to, zero. Such a model 
misfit can be accommodated by allowing a deviation of y from X6. 

In this relaxed setting of a sparse solution recovery, the notions of unique- 
ness and equivalence, concerning the 1$ and l\ solutions, no longer apply. 
Instead, the issue that now gains in importance is that of stability of the 
solution. To this end, we focus on the computationally attractive l\ task. 
The counterpart of Theorem [3] is now expressed as follows. 

Theorem 4. Assume that the sensing matrix, X, obeys the RIP with 82k < 
V2-1, for some k. Then the solution 0* of (|42|) satisfies the following 
( [Cand 06c[|Cand 08a| ), 



0-0*|| 2 < C k-2 110-0*11! + Civ^, 



(43) 



for some constants C\, Cq. 



This is also an elegant result. If the model is exact and e = we obtain 
(39). If not, the higher the uncertainty (noise) term in the model, the higher 



our ambiguity about the solution. Note, also, that the ambiguity about the 
solution depends on how far the true model is from 0^. If the true model is 
A;-sparse, the first term on the right hand side of the inequality is zero. The 
values of C\,Cq depend on 82k but they are small, e.g., close to five or six, 
|Cand 08a 



The important conclusion, here, is that the LASSO formulation for solv- 
ing inverse problems (which in general tend to be ill-conditioned) is a stable 
one and the noise is not amplified excessively during the recovery process. 
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10 Compressed Sensing: The Glory of Random- 



The way in which this paper was deplored followed, more or less, the se- 
quence of developments that took place during the evolution of the sparsity- 
aware parameter estimation field. We intentionally made an effort to follow 
such a path, since this is also indicative of how science evolves in most cases. 
The starting point had a rather strong mathematical flavour: to develop con- 
ditions for the solution of an underdetermined linear system of equations, 
under the sparsity constraint and in a mathematically tractable way, i.e., 
using convex optimization. In the end, the accumulation of a sequence of in- 
dividual contributions revealed that the solution can be (uniquely) recovered 
if the unknown quantity is sensed via randomly chosen data samples. This 
development has, in turn, given birth to a new field with strong theoretical 
interest as well as with an enormous impact on practical applications. This 
new emerged area is known as compressed sensing or compressive sampling 
(CS). Although CS builds around the LASSO and Basis Pursuit (and vari- 
ants of them, as we will soon see), it has changed our view on how to sense 
and process signals efficiently. 

10.0.2 Compressed Sensing 

In compressed sensing, the goal is to directly acquire as few samples as 
possible that encode the minimum information, which is needed to obtain 
a compressed signal representation. In order to demonstrate this, let us 
return to the data compression example, which was discussed in Section [5j 
There, it was commented that the "classical" approach to compression was 
rather unorthodox, in the sense that first all (i.e., a number of /) samples of 
the signal are used, then they are processed to obtain I transformed values, 
from which only a small subset is used for coding. In the CS setting, the 
procedure changes to the following one. 

Let X be an N x I sensing matrix, which is applied to the (unknown) 
signal vector, s, in order to obtain the measurements, y, and ^ be the 
dictionary matrix that describes the domain where the signal s accepts a 
sparse representation, i.e., 



Assuming that at most k of the components of 6 are nonzero, this can be 
obtained by the following optimization task 



ness 



s = ®0 



y = Xs. 



(44) 



eeii 1 
s.t. 



mm 




i 



y = x^e, 



(45) 
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provided that the combined matrix A\I/ complies with the RIP and the num- 



ber of measurements, N, satisfies the associated bound given in (40). Note 
that s needs not to be stored and can be obtained any time, once is known. 
Moreover, as we will soon discuss, the measurements, y n , n = 1, 2, . . . , N, 
can be acquired directly from an analogue signal s(t), prior to obtaining its 
sample (vector) version, s! Thus, from such a perspective, CS fuses the data 
acquisition and the compression steps together. 

There are different ways to obtain a sensing matrix, X, that leads to a 
product X*$>, which satisfies the RIP. It can be shown, that if \& is orthonor- 
mal and X is a random matrix, which is constructed as discussed at the 



end of Section 8.2, then the product A^ obeys the RIP, provided that (40) 
is satisfied, |Cand 08a| . An alternative way to obtain a combined matrix, 
that respects the RIP, is to consider another orthonormal matrix <E>, whose 
columns have low coherence with the columns of (coherence between two 



matrices is defined in (32), where, now, the pace of Xj is taken by a column 
of $ and that of Xj by a column of ^). For example, <3? could be the DFT 
matrix and ^ = I or vice versa. Then choose N rows of $ uniformly at 



random to form X in (44). In other words, for such a case, the sensing 
matrix can be written as R<&, where R is a N x I matrix that extracts N co- 
ordinates uniformly at random. The notion of incoherence (low coherence) 
between the sensing and the basis matrices is closely related to RIP. The 
more incoherent the two matrices are, the less the number of the required 



measurements for the RIP to hold, e.g., Cand 06b Rude 08 . Another way 
to view incoherence is that the rows of <1> cannot be sparsely represented in 
terms of the columns of It turns out that if the sensing matrix A is a 
random one, formed as it has already been described in Section [8. 2. 1[ then 
RIP and the incoherence with any \I' are satisfied with high probability. 

The news get even better to say that all the previously stated philosophy 
can be extended to the more general type of signals, which are not, necessar- 
ily, sparse or sparsely represented in terms of the atoms of a dictionary, and 
they are known as compressible. A signal vector is said to be compressible 
if its expansion in terms of a basis consists of just a few large coefficients 6i 
and the rest are small. In other words, the signal vector is approximately 
sparse in some basis. Obviously, this is the most interesting case in practice, 
where exact sparsity is scarcely (if ever) met. Reformulating the arguments 
used in Section [9j the CS task for this case can be cast as: 

min 

s.t. \\y-Xm\\l < e, (46) 

and everything that has been said in Section [9] is also valid for this case, if 
in place of A we consider the product X^>. 



Remarks 9. 
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x(t) 



Correlator: 
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rL 

Figure 11: Sampling an analogue signal s(t) in order to generate the mea- 
surement y n at the time instant n. The sampling period T s is much lower 
than that required by the Nyquist sampling. 



• An important property in compressed sensing is that the sensing ma- 
trix, which provides the measurements, may be chosen independently 
on the matrix fy; that is, the basis/dictionary in which the signal 
is sparsely represented. In other words, the sensing matrix can be 
"universal" and can be used to provide the measurements for recon- 
structing any sparse or sparsely represented signal in any dictionary, 
provided RIP is not violated. 

• Each measurement, y n , is the result of an inner product (projection) 
of the signal vector with a row, x^, of the sensing matrix, X. Assum- 
ing that the signal vector, s, is the result of a sampling process on 
an analogue signal, s(t), then y n can be directly obtained, to a good 
approximation, by taking the inner product (integral) of s(t) with a 
sensing waveform, x n (t), that corresponds to x n . For example, if X 
is formed by ±1, as described in Section [8.2.1 then the configuration 



shown in Fig. 11 can result to y n . An important aspect of this ap- 
proach, besides avoiding to compute and store the I components of s, 
is that multiplying by ±1 is a relatively easy operation. It is equivalent 
with changing the polarity of the signal and it can be implemented by 
employing inverters and mixers. It is a process that can be performed, 
in practice, at much higher rates than sampling (we will come to it 
soon). If such a scenario is adopted, one could obtain measurements 
of an analogue signal at much lower rates than required for classical 
sampling, since iV is much lower than I. The only condition is that the 
signal vector must be sparse, in some dictionary, which may not neces- 
sarily be known during the data acquisition phase. Knowledge of the 
dictionary is required only during the reconstruction of s. Thus, in the 
CS rationale, processing complexity is removed from the "front end" 
and is transferred to the "back end" , by exploiting l\ optimization. 

One of the very first applications that were inspired by the previous 



approach, is the so-called one pixel camera Takh 06 . This was one 
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among the most catalytic examples, that spread the rumour about the 
practical power of CS. CS is an example of what is commonly said: 
"There is nothing more practical than a good theory" ! 

10.0.3 Dimensionality Reduction and Stable Embeddings 

We are now going to shed light to what we have said so far in this paper 
from a different view. In both cases, either when the unknown quantity 
was a fc-sparse vector in a high dimensional space, TZ l , or if the signal s 
was (approximately) sparsely represented in some dictionary (s = ^>9), we 
chose to work in a lower dimensional space (1Z N ); that is, the space of the 
measurements, y. This is a typical task of dimensionality reduction. The 
main task in any (linear) dimensionality reduction technique is to choose 
the proper matrix X, that dictates the projection to the lower dimensional 
space. In general, there is always a loss of information by projecting from 1Z 
to 1Z N , with N < I, in the sense that we cannot recover any vector, Q\ ElZ 1 , 
from its projection 6^ G 1Z N ■ Indeed, take any vector Oi-n £ null(X), that 
lies in the (I — iV)-dimensional null space of the (full rank) X (see Section 
£5J). Then, all vectors 0\ + Q\-n G 7Z l share the same projection in 7Z N . 
However, what we have discovered in our tour in this paper is that if the 
original vector is sparse then we can recover it exactly. This is because all 
the fc-sparse vectors do not lie anywhere in 1Z l , but rather in a subset of 
it; that is, in the union of subspaces, each one having dimensionality k. If 
the signal s is sparse in some dictionary ^, then one has to search for it in 
the union of all possible A:-dimensional subspaces of 1Z l , which are spanned 



by k column vectors from ^, Bara 10a, Lu 08a . Of course, even in this 



case, where sparse vectors are involved, not any projection can guarantee 
unique recovery. The guarantee is provided if the projection in the lower 
dimensional space is a stable embedding. A stable embedding in a lower 
dimensional space must guarantee that if 6\ ^ 62, then their projections 
remain also different. Yet this is not enough. A stable embedding must 
guarantee that distances are (approximately) preserved; that is, vectors that 
lie far apart in the high dimensional space, have projections that also lie far 
apart. Such a property guarantees robustness to noise. Well, the sufficient 
conditions, which have been derived and discussed throughout this paper, 
and guarantee the recovery of a sparse vector lying in TZ l from its projections 
in 1Z N , are conditions that guarantee stable embeddings. The RIP and 
the associated bound on iV provides a condition on X that leads to stable 
embeddings. We commented on this norm-preserving property of RIP in the 
related section. The interesting fact that came out from the theory is that 
we can achieve such stable embeddings via random projection matrices. 

Random projections for dimensionality reduction are not new and have 
extensively been used in pattern recognition, clustering and data mining, 
see, e.g., |Achl 01 , Blum 06 Dasg 00 Theo 09 . More recently, the spirit 



10. COMPRESSED SENSING: THE GLORY OF RANDOMNESS 



40 



underlying compressed sensing has been exploited in the context of pattern 
recognition, too. In this application, one needs not to return to the original 
high dimensional space, after the information-digging activity in the low di- 
mensional measurements subspace. Since the focus in pattern recognition 
is to identify the class of an object/pattern, this can be performed in the 
measurements subspace, provided that there is no class-related information 



loss. In Cald 091, it is shown, using compressed sensing arguments, that 



if the data is approximately linearly separable in the original high dimen- 
sional space and the data has a sparse representation, even in an unknown 
basis, then projecting randomly in the measurements subspace retains the 
structure of linear separability. 

Manifold learning is another area where random projections have been 
recently applied. A manifold is, in general, a nonlinear A;- dimensional sur- 
face, embedded in a higher dimensional (ambient) space. For example, the 
surface of a sphere is a two-dimensional manifold in a three-dimensional 
space. More on linear and nonlinear techniques for manifold learning can 



be found in, e.g., Theo 09 . In [Waki 08,Bara 09 , the compressed sensing 



rationale is extended to signal vectors that live along a fc-dimensional sub- 
manifold of the space 1Z 1 . It is shown that choosing a matrix, X, to project 
and a sufficient number, N, of measurements, then the corresponding sub- 
manifold has a stable embedding in the measurements subspace, under the 
projection matrix, X; that is, pairwise Euclidean and geodesic distances are 
approximately preserved after the projection mapping. More on these issues 



can be found in the given references and in, e.g., [Bara 10a 



10.0.4 Sub-Nyquist Sampling: Analog-to-Information Conversion 

In our discussion in the Remarks presented before, we touched a very im- 
portant issue; that of going from the analogue domain to the discrete one. 
The topic of analog-to-digital (A/D) conversion has been at the forefront 
of research and technology since the seminal works of Shannon, Nyquist, 



Whittaker and Kotelnikof were published, see, for example, Unse 00 for a 
thorough related review. We all know that if the highest frequency of an 
analog signal, s(t), is less than F/2, then Shannon's theorem suggests that 
no loss of information is achieved if the signal is sampled, at least, at the 
Nyquist rate of F = 1/T, where T is the corresponding sampling period, 
and the signal can be perfectly recovered by its samples 

s(t) = s(nT) sinc(Ft — n), 

n 

where sinc(-) is the sampling function 

. , sin(7r£) 
sinc(i) = — — . 



10. COMPRESSED SENSING: THE GLORY OF RANDOMNESS 



41 



While this has been the driving force behind the development of signal 
acquisition devices, the increasing complexity of emerging applications de- 
mands increasingly higher sampling rates, that cannot be accommodated 
by today's hardware technology. This is the case, for example, in wideband 
communications, where conversion speeds, as dictated by Shannon's bound, 
have become more and more difficult to obtain. Consequently, alternatives 
to high rate sampling are attracting a strong interest with the goal to re- 
duce the sampling rate by exploiting the underlying structure of the signals 
at hand. In many applications, the signal comprises a few frequencies or 
bands, see Fig. 12 for an illustration. In such cases, sampling at the Nyquist 
rate is inefficient. This is an old problem and it has been addressed by a 
number of authors, in the case where the locations of the non-zero bands 
in the frequency spectrum are known, see, e.g., Vaug 91, Lin 98 Venk 00 



CS theory has inspired research to study cases where the locations (carrier 
frequencies) of the bands are not known a- priori. A typical application of 
this kind, of high practical interest, lies within the field of Cognitive radio, 

In contrast to what we have studied so far in 



Yu 08,Tian 07,Mish 10 



e.g., 

this paper, the sparsity now characterizes the analog signal, and this poses 
a number of challenges that need to be addressed. In other words, one can 
consider that 

s (t) = Y, OiMt), 

where ipi(t), i £ I, are the functions that comprise the dictionary, and only 
a small subset of the coefficients 9i are nonzero. Note that although each 
one of the dictionary functions can be of high bandwidth, the true number of 
degrees of freedom of the signal is low. Hence, one would like to sample the 
signal not at the Nyquist rate but at a rate determined by the sparsity level 
of the coefficients' set. We refer to such a scenario as Analog-to-Information 
sampling or sub-Nyquist sampling. 

An approach inspired directly by the theory of CS was first presented 
Kiro 06 and later improved and theoretically developed in |Trop 10" 



m 



The approach builds around the assumption that the signal consists of a 
sum of sinusoids and the random demodulator of Figure 11 is adopted. In 
Mish 10 Mish lib , the more general case of a signal consisting of a number 
of frequency bands, instead of tones, was treated. In addition, the task 
of extracting each band of the signal from the compressed measurements, 
that enables (low rate) baseband processing, is addressed. In principle, 
CS related theory would enable far fewer data samples than traditionally 
required when capturing signals with relatively high bandwidth, but with 
a low information rate. However, from a practical point of view, there are 
still a number of hardware implementation related issues, such as random 
jittering, to be solved first, e.g., Chen 12 Maec 12] . 

An alternative path to sub-Nyquist sampling embraces a different class 
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Figure 12: The Fourier transform of an analogue signal, s(t), which is sparse 
in the frequency domain; only a limited number of frequency bands con- 
tribute to its spectrum content S(£l), where O stands for the angular fre- 
quency. Nyquist's theory guarantees that sampling at a frequency larger 
than or equal to twice the maximum max is sufficient to recover the orig- 
inal analogue signal. However, this theory does not exploit information 
related to the sparse structure of the signal in the frequency domain. 



of analog signals known as multipulse signals; that is, signals that consist 
of a stream of short pulses. Sparsity now refers to the time domain, and 
such signals may not even be bandlimited. Signals of this type can be 
met in a number of applications, such as in radar, ultrasound, bioimaging 

An approach, known 



Drag 07 



and neuronal signal processing, see, e.g. 
as finite rate of innovation sampling, passes an analogue signal, having k 
degrees of freedom per second, through a linear time invariant filter and then 
samples at a rate of 2k samples per second. Reconstruction is performed via 
rooting a high-order polynomial, see, e.g., Vett 02[[Blu 08| and the references 
therein. In Matu 12| , the task of sub-Nyquist sampling is treated using CS 
theory arguments and an expansion in terms of Gabor functions; the signal 
is assumed to consist of a sum of a few pulses of finite duration, yet of 
unknown shape and time positions. 

The task of sparsity-aware learning in the analogue domain is still in its 
early stages and there is currently a lot of on-going activity; more on this 



topic can be obtained in Duar 11 Mish 11a and the references there in. 



11 Sparsity-Promoting Algorithms 

In the previous sections, our emphasis was to highlight the most important 
aspects underlying the theory of sparse signal/vector recovery from an un- 
derdetermined set of linear equations. We now turn our attention to the 
algorithmic aspects of the problem Elad 10 . The issue now becomes that 



of discussing efficient algorithmic schemes, which can achieve the recovery 
of the unknown set of parameters. In Sections [4] and [6j we saw that the 
constrained l\ norm minimization (Basis Pursuit) can be solved via Lin- 
ear Programming techniques and the LASSO task via convex optimization 
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schemes. However, such general purpose techniques tend to be inefficient, 
since, often, they require many iterations to converge and the respective 
computational resources can be excessive for practical applications, espe- 
cially in high dimensional spaces, K l . As a consequence, a huge research 
effort has been invested with the goal to develop efficient algorithms, that 
are tailored-made to these specific tasks. This is still a hot on-going area 
of research and definite conclusions are still risky to be drawn. Our aim 
here is to provide the reader with some general trends and philosophies that 
characterize the related activity. We will focus on the most commonly used 
and cited algorithms, which at the same time are structurally simple and the 
reader can follow them, without requiring a deeper knowledge on optimiza- 
tion. Moreover, these algorithms involve, in one way or another, arguments 
that are directly related to points and notions that we have already used 
while presenting the theory; thus, they can also be exploited from a pedagog- 
ical point of view, in order to strengthen the reader's understanding of the 
topic. We start our review with the class of batch algorithms, where all data 
are assumed to be available prior to the application of the algorithm, and 
then we will move on to online/time- adaptive schemes. Furthermore, our 
emphasis is on algorithms that are appropriate for any sensing matrix. This 
is stated in order to point out that in the literature efficient algorithms have 
also been developed for specific forms of highly structured sensing matri- 
ces; exploiting their particular structure can lead to reduced computational 
demands, |Gilb 05[|Need~09~ 



There are currently three rough types of families along which this al- 
gorithmic activity is growing: a) greedy algorithms, b) iterative shrinkage 
schemes, and c) convex optimization techniques. We have used the word 
rough, since, in some cases, it may be difficult to assign an algorithm to a 
specific family. 

11.1 Greedy Algorithms 



Greedy algorithms have a long history, see, for example, Teml 03 for a com- 



prehensive list of references. In the context of dictionary learning, a greedy 



algorithm known as Matching Pursuit was introduced in [Mall 93 . A greedy 



algorithm is built upon a series of locally optimal single-term updates. In 
our context, the goals are: a) to unveil the "active" columns of the sensing 
matrix X; that is, those columns that correspond to the nonzero locations of 
the unknown parameters and b) to estimate the respective sparse parameter 
vector. The set of indices which correspond to the nonzero vector compo- 
nents is also known as the support. To this end, the set of active columns of X 
(and the support) is increased by one at each iteration step. In the sequel, an 
updated estimate of the unknown sparse vector is obtained. Let us assume 
that, at the (i — l)th iteration step, the algorithm has selected the columns 
denoted as Xj a , Xj 2 , . . . , , with ji, j2, ■ • • , ji-i £ {1,2,...,/}. These in- 
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dices are the elements of the currently available support, S^ 1 ^. Let X^ 1 ^ 
be the N x (i — 1) matrix having Xj 17 Xj 2 , . . . ,x J - i _ 1 as its columns. Let, 
also, the current estimate of the solution be 6^ % ~ l \ which is a (i — l)-sparse 
vector, with zeros at all locations with index outside the support. 

Algorithm 1 (Orthogonal Matching Pursuit (OMP)). 

The algorithm is initialized with 9^ := 0, e(°) := y and 5 (0) := 0. At 
iteration step i, the following computational steps are performed: 

1. Select the column Xj i of X, which is maximally correlated to (forms 
the least angle with) the respective error vector, e^ _1 ) := y — X6^ l ~ l \ 
i.e., 

Xj, : ji ■= argmax j=12 >; — — . 

Il x i lb 

2. Update the support and the corresponding set of active columns: 
S (i) = S (i-i) u ^.} ; and X {i) = [x(*-i), Xii ]. 

3. Update the estimate of the parameter vector: Solve the Least-Squares 
(LS) problem that minimizes the norm of the error, using the active 
columns of X only, i.e., 

§ := arg mm ze1l i 

Obtain 9^ by inserting the elements of 6 in the respective locations 
(ji,j2, . . . ,ji), which comprise the support (the rest of the elements of 
(?W retain their zero values). 

4. Update the error vector 

eW :=y-X0®. 



A«; 



The algorithm terminates if the norm of the error becomes less than 
a preselected user-defined constant, eo- The following observations are in 
order. 

Remarks 10. 

• Since 6^ % \ in Step 3, is the result of a LS task, it is known that the error 
vector is orthogonal to the subspace spanned by the active columns 
involved, i.e., 

eW_Lspan{xj 1 , . . . ,XjJ . 

This guarantees that taking the correlation, in the next step, of the 
columns of X with none of the previously selected columns will 
be reselected; they result to zero correlation, being orthogonal to e®, 
see Fig. [13 
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,(2) 




Figure 13: The error vector at the ith iteration is orthogonal to the sub- 
space spanned by the currently available set of active columns. Here is an 
illustration for the case of the 3-dimensional Euclidean space and for 
i = 2. 



• It can be shown that the column, which has maximal correlation (max- 
imum absolute value of the inner product) with the currently available 
error vector, is the one that maximally reduces (compared to any other 
column) the £2 norm of the error, when y is approximated by linearly 
combining the currently available active columns. This is the point 
where the heart of the greedy strategy beats. This minimization is 
with respect to a single term, keeping the rest fixed, as they have been 



obtained from the previous iteration steps Bruc 09|. 



Starting with all the components being zero, if the algorithm stops 
after ko iteration steps, the result will be a fco-sparse solution. 

Note that there is no optimality in this searching strategy. The only 
guarantee is that the £2 norm of the error vector is decreased at every 
iteration step. In general, there is no guarantee that the algorithm can 
obtain a solution close to the true one, see, e.g., DeVo 96 . However, 



under certain constraints on the structure of X, performance bounds 
can be obtained, see, e.g., [Trop 04[|Dave 10[|Zhan~TT] . 



• The complexity of the algorithm amounts to 0(kolN) operations, 
which are contributed by the computations of the correlations, plus 
the demands raised by the solution of the LS task, in Step [3| whose 
complexity depends on the specific algorithm used. The ko is the 
sparsity level of the delivered solution and, hence, the total number of 
iteration steps that are performed. 

Another more qualitative argument, that justifies the selection of the 
columns based on their correlation with the error vector, is the following. 
Assume that the matrix X is orthonormal. Let also y = XO. Then, y 
lies in the subspace spanned by the active columns of X, i.e., those which 
correspond to the non-zero components of 0. Hence, the rest of the columns 
are orthogonal to y, since X is assumed to be orthonormal. Taking the 
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correlation of y, at the first iteration step, with all the columns, it is certain 
that one among the active columns will be chosen. The inactive columns 
result in zero correlation. A similar argument holds true for all subsequent 
steps, since all the activity takes place in a subspace that is orthogonal to 
all the inactive columns of X. In the more general case, where X is not 
orthonormal, we can still use the correlation as a measure that quantifies 
geometric similarity. The smaller the correlation/the magnitude of the inner 
product is, the more orthogonal two vectors are. This brings us back to the 
notion of mutual coherence, which is a measure of the maximum correlation 
(least angle) among the columns of X. 



11.1.1 OMP Can Recover Optimal Sparse Solutions: Sufficiency 
Condition 

We have already stated that, in general, there are no guarantees that OMP 
will recover optimal solutions. However, when the unknown vector is suffi- 
ciently sparse, with respect to the structure of the sensing matrix X, then 
OMP can exactly solve the £q minimization task in (27) and recover the solu- 



tion in ko steps, where ko is the sparsest solution that satisfies the associated 
linear set of equations. 



Theorem 5. Let the mutual coherence (Section 7.1) of the sensing matrix, 
X, be n(X). Assume, also, that the linear system, y = X6, accepts a 
solution such as 

l|8|| ° < k( 1+ m)- (47) 

Then, OMP guarantees to recover the sparsest solution in ko = \\B\\ steps. 



We know from Section 7.1 that, under the previous condition, any other 



solution will be necessarily less sparse. Hence, there is a unique way to 
represent y in terms of ko columns of X. Without harming generality, let 
us assume that the true support corresponds to the first ko columns of X, 
i.e., 

ko 

y = ^2e jXj , %/0,VjG{l,...,A; Q }. 
The theorem is a direct consequence of the following proposition: 



Proposition 1. If the condition (47) holds true, then the OMP algorithm 
will never select a column with index outside the true support, see, e.g., 
|Trop 04| . In a more formal way, this is expressed as 



Ji = argmax j=12 ^ 



x J c (i-l) 



X 



e {!,..., M- 



J 112 
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(a) (b) 



Figure 14: (a) In the case of an orthogonal matrix, the measurement vector 
y will be orthogonal to any inactive column; here, X3. (b) In the more 
general case, it is expected to "lean" closer (form smaller angles) to the 
active than to the inactive columns. 



A geometric interpretation of this proposition is the following: if the 
angles formed between all the possible pairs among the columns of X are 
large enough in the 1Z space, which guarantees that n{X) is small enough, 
then y will lean more (form smaller angle) towards any one of the active 
columns, which contribute to its formation, compared to the rest, which are 
inactive and do not participate in the linear combination that generates y. 
Fig. [14] illustrates the geometry, for the extreme case of mutually orthogonal 



vectors (Fig. 14 1) and for the more general case, where the vectors are 
not orthogonal, yet the angle between any pair of columns is large enough 

(Fig.[g>). 

In a nutshell, the previous proposition guarantees that, during the first 
iteration, a column corresponding to the true support will be selected. In 
a similar way, this is also true for all subsequent iterations. In the second 
step, another, different from the previously selected column (as it has already 
been stated), will be chosen. At step ko, the last remaining active column, 
corresponding to the true support, is selected and this necessarily results to 
zero error. To this end, it suffices to set eo equal to zero. 

11.1.2 The LARS Algorithm 



The Least Angle Regression (LARS) algorithm, Efro 04 , shares the first 
two steps with OMP. It selects ji to be an index outside the currently avail- 
able active set so that to maximize the correlation with the residual vector. 
However, instead of performing an LS fit to compute the nonzero compo- 
nents of 0W, these are computed so that the residual to be equicorrelated 
with all the columns in the active set, i.e., 

|xj(y - X0®)\ = constant, Vj e , 

where we have assumed that the columns of X are normalized, as it is 
common in practice (recall, also, the Remarks]!]). In other words, in contrast 
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to the OMP, where the error vector is forced to be orthogonal to the active 
columns, LARS demands this error to form equal angles with each one of 
them. Likewise OMP, it can be shown that, provided the target vector is 
sufficiently sparse and under incoherence of the columns of X, LARS can 



exactly recover the sparsest solution, Tsai 07 



A further small modification leads to the so-called LARS-LASSO algo- 
rithm. According to this version, a previously selected index in the active 
set can be removed at a later stage. This gives the algorithm the potential 
to "recover" from a previously bad decision. Hence, this modification de- 
parts from the strict rationale that defines the greedy algorithms. It turns 
out that this version solves the LASSO optimization task. This algorithm 
is the same as the one suggested in [Osbo 00] and it is known as homotopy 
algorithm. Homotopy methods are based on a continuous transformation 
from one optimization task to another. The solutions to this sequence of 
tasks lie along a continuous parameterized path. The idea is that, while the 
optimization tasks may be difficult to solve by themselves, one can trace this 
path of solutions by slowly varying the parameters. For the LASSO task, it 



is the A parameter which is varying, see, e.g., Plum 05, Mali 05,Asif 10a 



Take as an example the LASSO task in its regularized version in (13). For 
A = 0, the task minimizes the £2 norm and for A — > 00 the task minimizes 
the t\ norm, and for this case the solution tends to zero. It turns out that 
the solution path, as A changes from large to small values, is polygonal. Ver- 
tices on this solution path correspond to vectors having nonzero elements 
only on a subset of entries. This subset remains unchanged, till A reaches 
the next critical value, which corresponds to a new vertex of the polygonal 
path and to a new subset of potential nonzero values. Thus, the solution is 
obtained via this sequence of steps along this polygonal path. 



11.1.3 Compressed Sensing Matching Pursuit (CSMP) Algorithms 

Strictly speaking, these algorithms are not greedy, yet, as it is stated in 
|Need 09] , they are at heart greedy algorithms. Instead of performing a 
single term optimization per iteration step, in order to increase the support 
by one, as it is the case with OMP, these algorithms attempt to obtain first 
an estimate of the support and then use this information to compute a least 
squares estimate of the target vector, constrained on the respective active 
columns. The quintessence of the method lies in the near-orthogonal nature 
of the sensing matrix, assuming that this obeys the RIP. 

Assume that X obeys the RIP for some small enough value 5k and spar- 
sity level, k, of the unknown vector. Let, also, that the measurements are 
exact, i.e., y = XO. Then, X T y = X T X6 ~ 6. Therefore, intuition indi- 
cates that it is not unreasonable to select, in the first iteration step, the t 
(a user-defined parameter) largest in magnitude components of X T y as in- 
dicative of the nonzero positions of the sparse target vector. This reasoning 
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carries on for all subsequent steps, where, at the ith iteration, the place of 
y is taken by the residual e^ -1 ) := y — XQ^ -1 ', where indicates the 

estimate of the target vector at the (i — l)th iteration. Basically, this could 
be considered as a generalization of the OMP. However, as we will soon see, 
the difference between the two mechanisms is more substantial. 

Algorithm 2 (The CSMP Scheme). 

1. Select the value of t. 

2. Initialize the algorithm: 0(°) = 0, e(°) = y. 

3. For i = 1,2,..., execute the following. 

(a) Obtain the current support: 



s (i) 



( ~ti iA f indices of the t largest in magnitude ] 
:=supp lev- 1 * )U< i _ c vT fv \ 

\ ) y components of X 1 e 1 - 1 L > J 



(b) Select the active columns: Construct X0 to comprise the active 
columns of X in accordance to Obviously, X" is a N x r 
matrix, where r denotes the cardinality of the support set S^' . 

(c) Update the estimate of the parameter vector: solve the LS task 



6 := arg max zenr 



y 



2 



2 



Obtain 0W G 7Z l having the r elements of 6 in the respective lo- 
cations, as indicated by the support, and the rest of the elements 
being zero. 

(d) 00 := H k (00). The mapping denotes the hard thresholding 
operator; that is, it returns a vector with the k largest in mag- 
nitude components of the argument, and the rest are forced to 
zero. 

(e) Update the error vector: e0 = y — X00 . 

The algorithm requires as input the sparsity level k. Iterations carry on 
until a halting criterion is met. The value of t, that determines the largest 
in magnitude values in Steps [T] and 3a, depends on the specific algorithm. 



In CoSaMP (Compressive Sampling Matching Pursuit, |Need 09] ), t = 2k 
and in the SP (Subspace Pursuit, |Dai 09]), t = k. 

Having stated the general scheme, a major difference with OMP becomes 
readily apparent. In OMP, only one column is selected per iteration step. 
Moreover, this remains in the active set for all subsequent steps. If, for 
some reason, this was not a good choice, the scheme cannot recover from 
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such a bad decision. In contrast, the support and hence the active columns 
of X are continuously updated in CSMP and the algorithm has the ability 
to correct a previously bad decision, as more information is accumulated 



and iterations progress. In Dai 09 , it is shown that if the measurements 
are exact (y = X6) then SP can recover the /c-sparse true vector in a 
finite number of iteration steps, provided that X satisfies the RIP with 
< 0.205. If the measurements are noisy, performance bounds have been 
derived, which hold true for 63k < 0.083. For the CoSaMP, performance 
bounds have been derived for 5^ < 0.1. 

11.2 Iterative Shrinkage Algorithms (1ST) 

This family of algorithms have also a long history, see, e.g., [Jans 84|Dono 94 



Hoch 90 King 02] . However, in the "early" days, the developed algorithms 
had some sense of heuristic flavor, without establishing a clear bridge with 
optimizing a cost function. Later attempts were substantiated by sound the- 
oretical arguments concerning issues such as convergence and convergence 
rate, e.g., [Figu 03[|Daub 04[|Elad 07a]|Comb 05 



The general form of this algorithmic family has a striking resemblance 
with the classical linear algebra iterative schemes for approximating the so- 
lution of large linear systems of equations, known as stationary iterative or 
iterative relaxation methods. The classical Gauss-Seidel and Jacobi algo- 
rithms, e.g., |Hage 81 , in numerical analysis can be considered as members 



of this family. Given a linear system of I equations with I unknowns, z = Ax, 
the basic iteration at step i has the following form 

x® = (/ - QA) x^ + Qz 

= x^ + Qe^\ :=z- Ax^\ 

which does not come as a surprise. It is of the same form as most of the 
iterative schemes for numerical solutions! The matrix Q is chosen so that 
to guarantee convergence and different choices lead to different algorithms 
with their pros and cons. It turns out that this algorithmic form can also 
be applied to underdetermined systems of equations, y = X6, with a "mi- 
nor" modification, which is imposed by the sparsity constraint of the target 
vector. This leads to the following general form of iterative computation 

ffW = Ti (O^-V + Qe^A , e^ 1 ) = y - X9^~ l \ 



starting from an initial guess of 6^ (usually 0^ = 0, e^ ) = y). In certain 
cases, Q can be made to be iteration-dependent. The operator Tj(-) is a non- 
linear thresholding operator, that is applied entrywise, i.e., component-wise. 
Depending on the specific scheme, this can be either the hard thresholding 
operator, denoted as Hk, or the soft thresholding operator, denoted as S a . 
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Hard thresholding, as we already know, keeps the k largest components of 
a vector unaltered and sets the rest equal to zero. Soft thresholding was 
introduced in Section [4} All components with magnitude less than a are 
forced to zero and the rest are reduced in magnitude by a; that is, the jth 
component of a vector, 6, after soft thresholding becomes 

(S a (O)) j = Bga(0 j )(\9 j \-a) + . 

Depending on a) the choice of Tj, b) the specific value of the parameter k 
or a and c) the matrix Q, different instances occur. A most common choice 
for Q is \xX T and the generic form of the main iteration becomes 

ffW = Ti (O^-V + fjO^e^A , (48) 



where fi is a relaxation (user-defined) parameter, which can also be left to 
vary with each iteration step. The choice of X T is intuitively justified, once 
more, by the near-orthogonal nature of X. For the first iteration step and 
for a linear system of the form y = XO, starting from a zero initial guess, 
we have X T y = X T X6 ~ and we are close to the solution. 

Although intuition is most important in scientific research, it is not 
enough, by itself, to justify decisions and actions. The generic scheme in 



(48) has been reached from different paths, following different perspectives 
that lead to different choices of the respective parameters. Let us spend 
some more time on that, with the aim to make the reader more familiar 
with techniques that address optimization tasks of non-differentiable loss 



functions. The term in the parenthesis in (48) coincides with the gradient 
descent iteration step if the cost function were the unregularized LS loss, 
i.e., 

J{6)= l -\\y-Xe\\l. 
In this case, the gradient descent rationale leads to 

_ tl 9J S^3 = fl (i-D _ ^X T (X0^ - y) 

= 0( i - 1 )+ At X T e( 1 '- 1 ). 

It is well known and it can easily be shown that the gradient descent can 
alternatively be viewed as the result of minimizing a regularized version of 
the linearized loss function, i.e., 

0« = argmin^ { J (*C*-U) + (e - ^ 



1 

+ 2/z 



e - e^-v 



dO 

2" 



(49) 
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One can adopt this view of the gradient descent philosophy as a kick-off 
point to minimize iteratively the following LASSO task, i.e., 

min |l(0,A) = ~ \\y - X0\\\ + A {{0^ = J{9) + \\\0\\-\ . 

The difference now is that the loss function comprises two terms. One which 
is smooth (differentiable) and a non-smooth one. Let the current estimate 
be 6^ l ~ l \ The updated estimate is obtained by 



00 



ar 



gmin^jj^- 1 )) +{e-e^ 



+ 



2/t 



86 

+ x\\e\\ 1 



e - 



which, after ignoring constants, becomes 



00 



where 



argmin 0eTCi 



:= 0^ 



6-9 



+ Vll^lli 



80 



(50) 
(51) 



Following exactly the same steps as those that led to the derivation of (20) 

i 

dJ{G^- l ))\ 



from (13) (after replacing 6l$ with 6) we obtain 
00 



Sx»(0) = S Xfl 6^ 



fi- 



ae 



s x Je^+»x T e^ 



(52) 
(53) 



This is very interesting and practically useful. The only effect of the pres- 
ence of the non-smooth i\ norm in the loss function is an extra simple 
thresholding operation, which as we know is an operation performed indi- 
vidually on each component. It can be shown, e.g., Beck 09 , that this 
algorithm converges to a minimizer 6* of the LASSO (13), provided that 
H G (0, 1 / X max (X T X)) , where A max (-) denotes the maximum eigenvalue of 
X T X. The convergence rate is dictated by the rule 

L(0«,A)-L(0*,A) « 0(l/i), 

which is known as sublinear global rate of convergence. Moreover, it can be 
shown that 

L(0«,A) -L(0*,A) < -L ^. 
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The latter result indicates that if one wants to achieve an accuracy of e, then 



this can be obtained by at most 
the floor operator. 



C||0(°)-0, 
2i 



iterations, where [-J denotes 



In Daub 04 , (48) was obtained from a nearby corner, building upon 



arguments from the classical proximal-point methods in optimization theory, 
e.g., |Rock 76] . The original LASSO regularized cost function is modified to 
the so called surrogate objective, 

J(0,0) = ±\\y-X9\\l + \\\0\\ 1 + ~d{0,0) 1 



where 



d{9,9) 



9-9 



X9-X9 



If c is appropriately chosen (larger than the largest eigenvalue of X T X), 
the surrogate objective is guaranteed to be strictly convex. Then it can be 
shown that the minimizer of the surrogate objective is given by 



9 = S 



X/c 



1 



9 + -X 1 (y 

c 



X9 



(54) 



In the iterative formulation, 9 is selected to be the previously obtained es- 
timate; in this way, one tries to keep the new estimate close to the previous 



one. The procedure readily results to our generic scheme in (48), using soft 



thresholding with parameter X/c. It can be shown that such a strategy con- 
verges to a minimizer of the original LASSO problem. The same algorithm 
was reached in Figu 03 , using majorization-minimization techniques from 
optimization theory. So, from this perspective, the 1ST family has strong 
ties with algorithms that belong to the convex optimization category 

In IWrig 09b , the Sparse Reconstruction by Separable Approximation 
(SpaRSA) algorithm is proposed, which is a modification of the standard 
1ST scheme. The starting point is (49); however, the multiplying factor, 
2—, instead of being constant is now allowed to change from iteration to 
iteration according to a rule. This results in a speed up in the convergence 
of the algorithm. Moreover, inspired by the homotopy family of algorithms, 
where A is allowed to vary, SpaRSA can be extended to solve a sequence of 
problems which are associated with a corresponding sequence of values of A. 
Once a solution has been obtained for a particular value of A, it can be used 
as a "warm-start" for a nearby value. Solutions can therefore be computed 
for a range of values, at a small extra computational cost, compared to 
solving for a single value from a "cold start". This technique abides with 
to the so-called continuation strategy, which has been used in the context 
of other algorithms as well, e.g., |Hale 07 . Continuation has been shown to 
be a very successful tool to increase the speed of convergence. 
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An interesting modification of the basic 1ST scheme has been proposed in 
Beck 09 , which improves the convergence rate to 0(l/i 2 ), by only a simple 



modification with almost no extra computational burden. The scheme is 
known as Fast Iterative Shrinkage- Thresholding Algorithm (FISTA). This 
scheme is an evolution of Nest 83 , which introduced the basic idea for the 



case of differentiable costs, and consists of the following steps: 



where 



0(0 



S A fj, 

0(0 



y 



Xz^ 



+ 



ti+i 



0(0 _ e <i-i)) ; 



1 + + 4t? 



with initial points t\ = 1 and = 6^°\ In words, in the thresholding oper- 
ation, 0( 1-1 ) is replaced by z®, which is a specific linear combination of two 
successive updates of 0. Hence, at a marginal increase of the computational 
cost, a substantial increase in convergence speed is achieved. 

In Blum 09a the hard thresholding version has been used, with \l = 1 



and the thresholding operator uses the sparsity level k of the target 
solution, that is assumed to be known. In a later version, |Blum 10 , the 



relaxation parameter is left to change so that, at each iterations step, the 
error is maximally reduced. It has been shown that the algorithm converges 
to a local minimum of the cost function \\y — X0\\ 2 , under the constraint 
that 6 is a fc-sparse vector. Moreover, the latter version is a stable one and 
it results to a near optimal solution if a form of RIP is fulfilled. 

A modified version of the generic scheme given in (48), that evolves 



along the lines of |Luo 92 , obtains the updates component-wise, one vector 
component at a time. Thus, a "full" iteration consists of I steps. The 
algorithm is known as coordinate descent and its basic iteration has the 
form, 

-!)\ 



3 (0 




1,2,. ..,1. 



(55) 



This algorithm replaces the constant c, in the previously reported soft 
thresholding algorithm, with the norm of the respective column of A, if 
the columns of X are not normalized to unit norm. It has been shown that 
the parallel coordinate descent algorithm also converges to a LASSO mini- 
mizer of (13), lElad 07a . Improvements of the algorithm, using line search 



techniques to determine the most descent direction for each iteration, have 



also been proposed, see, Zibu 10 



The main contribution to the complexity for the iterative shrinkage algo- 
rithmic family comes from the two matrix- vector products, which amounts 
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to O(Nl), unless X has a special structure, e.g., DFT, that can be exploited 
to reduce the load. 



In Male 10|, the so-called Two Stage Thresholding (TST) scheme is pre- 



sented, which brings together arguments from the iterative shrinkage family 
and the OMP. This algorithmic scheme involves two stages of thresholding. 



The first step is exactly the same as in (48). However, this is now used 
only for determining "significant" nonzero locations, just as in Compressed 
Sensing Matching Pursuit (CSMP) algorithms, presented in the previous 
subsection. Then, a LS problem is solved to provide the updated estimate, 
under the constraint of the available support. This is followed by a second 
step of thresholding. The thresholding operations in the two stages can be 
different. If hard thresholding, H^, is used in both steps, this results to 



the algorithm proposed in Fouc 11 . For this latter scheme, convergence 
and performance bounds are derived if the RIP holds for 6^k < 0.58. In 
other words, the basic difference between the TST and CSMP approaches 
is that, in the latter case, the most significant non-zero coefficients are ob- 
tained by looking at the correlation term X T e^ 1 ^ and in the TST family at 
O^-V + f iX T e ( - i - 1 \ The differences among different approaches can be mi- 
nor and the crossing lines among the different algorithmic categories are not 
necessarily crispy clear. However, from a practical point of view, sometimes 
small differences may lead to substantially improved performance. 

Remarks 11. 



The minimization in (52) bridges the 1ST algorithmic family with an- 



other powerful tool in convex optimization, which builds around the 



notion of proximal mapping or Moreau envelopes, see, e.g., Rock 76 
Comb 11 . Given a convex function h : TZ — > TZ, and a \x > 0, the 



proximal mapping. Prox„/, :TZ l — > TZ , with respect to h, and of index 
/i, is defined as the (unique) minimizer 

Prox M / l (x) := argmin u67 ^i |/i(it) + ^-\\x — , \/x £ TZ 1 . (56) 

Let us now assume that we want to minimize a convex function, which 
is given as the sum 

f(P) = J(0) + h(0), 

where </(•) is convex and differentiable, and h(-) is also a convex, but 
not necessarily a smooth one. Then it can be shown that the following 
iterations converge to a minimizer of /(•), 

= Prox,, (V 1 ) - ^-^P) , (57) 
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where \x > and it can also be made iteration dependent, i.e., > 0. 
If we now use this scheme to minimize our familiar cost, i.e., 



J(0) + A||0| 



i ■ 



we obtain (52); this is so, because the proximal operator of h(6) := 
A || 6 1| 1 is shown ( |Comb l"T|Comb"05 ) to be identical to the soft thresh- 
olding operator, i.e., 

Pro X?i (0) = S X (0). 

In order to feel more comfortable with this operator, note that if 
h(x) := 0, its proximal operator is equal to x, and in this case (57) 
becomes our familiar gradient descent algorithm. 

All the non-greedy algorithms, which have been discussed so far, have 



been developed to solve the task defined in the formulation (13). This 



is mainly because this is an easier task to solve; once A has been fixed, it 
is an unconstrained optimization task. However, there are algorithms 
which have been developed to solve the alternative formulations. 



The NESTA algorithm has been proposed in [Beck 11 and solves the 



task in its (15) formulation. Adopting this path can have an advan- 



tage since e may be given as an estimate of the uncertainty associated 
with the noise, which can readily be obtained in a number of practical 
applications. In contr ast, selecting a-priori the value for A is more 
intricate. In Chen 98 , the value A = a^sjl In I, where a„ is the noise 
standard deviation, is argued to have certain optimality properties; 
however this argument hinges on the assumption of the orthogonality 
of X. NESTA relies heavily on Nesterov's generic scheme ( [Nest 83] ) , 
hence its name. The original Nesterov's algorithm performs a con- 
strained minimization of a smooth convex function f(6), i.e., 



min f (6) 

oeQ v ' 

where Q is a convex set, and in our case this is associated with the 



quadratic constraint in (15). The algorithm consists of three basic 



steps. The first one is similar with the step in (49), i.e 



w 



arg min 0eQ 



t dj(e^- 



de 



e - o^-v 



(58) 

where L is an upper bound on the Lipschitz coefficient, which the 
gradient of /(•) has to satisfy. The difference with (49) is that the 



minimization is now a constrained one. However, Nesterov has also 
added a second step involving another auxiliary variable, which 
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is computed in a similar way as w^> but the linearised term is now 
replaced by a weighted cumulative gradient, 

dJ (6K fc )) 
80 ' 



k=0 

The effect of this term is to smooth out the "zig-zagging" of the path 
towards the solution, whose effect is to increase significantly the con- 
vergence speed. The final step of the scheme involves an averaging of 
the previously obtained variables, 

00 = ti z® + (1 - U)w® . 

The values of the parameters a^, k = 0, . . . ,i — 1, and U result from 
the theory so that convergence is guaranteed. As it was the case with 
its close relative FISTA, the algorithm enjoys an 0(l/i 2 ) convergence 
rate. In our case, where the function to be minimized, 1 1 1 1 1 , is not 
smooth, NESTA uses a smoothed prox-function of it. Moreover, it 
turns out that close-form updates are obtained for 20 and i«0. If X is 
chosen so that to have orthonormal rows, the complexity per iteration 
is 0(1) plus the computations needed for performing the product X T X, 
which is the most computational thirsty part. However, this complex- 
ity can substantially be reduced if the sensing matrix is chosen to be 
a submatrix of a unitary transform, which admits fast matrix-vector 
product computations, e.g., a subsampled DFT matrix. For example, 
for the case of a subsampled DFT matrix, the complexity amounts to 
0(1) plus the load to perform the two Fast Fourier Transforms (FFT). 
Moreover, the continuation strategy can also be employed to acceler- 



ate convergence. In Beck 11 , it is demonstrated that NESTA exhibits 



good accuracy results, while retaining a complexity that is competitive 



with algorithms developed around the (15) formulation and scales in 
an affordable way for large size problems. Furthermore, NESTA, and 
in general Nesterov's scheme, enjoy a generality that allows their use 
to other optimization tasks as well. 



The task in (14) has been considered in Berg 08] and |Osbo 00 . In the 



former, the algorithm comprises a projection on the l\ ball ||0||i < p 



(see also, Section 13.4) per iteration step. The most computationally 



dominant part of the algorithm consists of matrix-vector products. In 



Osbo 00 , a homotopy algorithm is derived for the same task, where 
now the bound p becomes the homotopy parameter which is left to 
vary. This algorithm is also referred as the LARS-LASSO, as it has 
already been reported before. 
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11.3 Which Algorithm Then: Some Practical Hints 

We have already discussed a number of algorithmic alternatives to obtain 
solutions to the ^o/^i norm minimization task. Our focus was on schemes 
whose computational demands are rather low and they scale well to very 
large problem sizes. We have not touched more expensive methods such 
as interior point methods for solving the l\ convex optimization task. A 



review of such methods is provided in Kim 07 . Interior point methods 
evolve along the Newton-type recursion and their complexity per iteration 
step is at least of the order 0(l 3 ). As it is most often the case, there is a trade 
off. Schemes of higher complexity tend to result in enhanced performance. 
However, such schemes become impractical in problems of large size. Some 
examples of other algorithms, that were not discussed, can be found in 
[Yin 08 , Berg 08, Daub lOWrig 09b . Talking about complexity, it has to 



be pointed out that what really matters at the end is not so much the 
complexity per iteration step but the overall required resources in computer 
time/memory for the algorithm to converge to a solution within a specified 
accuracy. For example, an algorithm may be of low complexity per iteration 
step but it may need an excessive number of iterations to converge. 

Computational load is only one among a number of indices that charac- 
terize the performance of an algorithm. Other performance measures, refer 
to convergence rate, tracking speed (for the adaptive algorithms), and sta- 
bility with respect to the presence of noise and/or finite word length com- 
putations. No doubt, all these performance measures are also of interest 
here, too. However, there is an additional aspect that is of particular im- 
portance when quantifying performance of sparsity-promoting algorithms. 
This is related to the so called under sampling- spar sity tradeoff or the phase 
transition curve. 

One of the major issues, on which we focused in this paper, was to derive 
and present the conditions that guarantee uniqueness of the £q minimiza- 
tion and its equivalence with the l\ minimization task, under an underde- 
termined set of measurements, y = X6, for the recovery of sparse enough 
signals/vectors. While discussing the various algorithms in this section, we 
reported a number of different RIP-related conditions that some of the al- 
gorithms have to satisfy in order to recover the target sparse vector. As a 
matter of fact, it has to be admitted that this was quite confusing, since 
each algorithm had to satisfy its own conditions. In addition, in practice, 
these conditions are not easy to be verified. Although such results are, no 
doubt, important to establish convergence and make us more confident and 
also understand better why and how an algorithm works, one needs further 
experimental evidence in order to establish good performance bounds for 
an algorithm. Moreover, all the conditions that we have dealt with, includ- 
ing coherence and RIP, are sufficient conditions. In practice, it turns out 
that sparse signal recovery is possible with sparsity levels much higher than 
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Figure 15: For any algorithm, the transition between the regions of 100% 
success and of a complete failure is very sharp. For the algorithm corre- 
sponding to the red curve, this transition occurs at higher sparsity values 
and, from this point of view, it is a better algorithm than the one associated 
with the black curve. Also, given a algorithm, the higher the dimensionality 
the higher the sparsity level where this transition occurs, as indicated by 
the two red curves. 



those predicted by the theory, for given iV and I. Hence, proposing a new 
algorithm or selecting an algorithm from an available palette, one has to 
demonstrate experimentally the range of sparsity levels that can be recov- 
ered by the algorithm, as a percentage of the number of measurements and 
the dimensionality. Thus, in order to select an algorithm, one should cast 
his/her vote for the algorithm which, for given I and N, has the potential 
to recover fc-sparse vectors with k being as high as possible, for most of the 
cases, that is, with high probability. 

Fig. 15 illustrates the type of curve that is expected to result in practice. 
The vertical axis is the probability of exact recovery of a target fc-sparse 
vector and the horizontal axis shows the ratio k/N, for a given number of 
measurements, N, and the dimensionality of the ambient space, I. Three 
curves are shown. The red ones correspond to the same algorithm, for 
two different values of the dimensionality, I, and the gray one corresponds 
to another algorithm. Curves of this shape are expected to result from 
experiments of the following set up. Assume that we are given a sparse 
vector, 6q, with k nonzero components in the Z-dimensional space. Using a 
sensing matrix X, we generate N measurements y = XOq. The experiment 
is repeated a number of, say, M times, each time using a different realization 
of the sensing matrix and a different fc-sparse vector. For each instance, 
the algorithm is run to recover the target sparse vector. This is not always 
possible. We count the number, m, of successful recoveries, and compute the 
corresponding percentage of successful recovery (probability), m/M, which 



is plotted on the vertical axis of Fig. 15 The procedure is repeated for a 



different value of k, 1 < k < N. A number of issues now jump into the 
scene: a) How one selects the ensemble of sensing matrices and b) how one 
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selects the ensemble of sparse vectors. There are different scenarios and 
some typical examples are described next. 

1. The N x I sensing matrices X are formed by: 

(a) Different i.i.d. realizations with elements drawn from a Gaussian 
Af(0,l/N). 

(b) Different i.i.d. realizations from the uniform distribution on the 
unit sphere in TZ N , which is also known as the uniform spherical 
ensemble. 

(c) Different i.i.d. realizations with elements drawn from Bernoulli 
type distributions. 

(d) Different i.i.d. realizations of partial Fourier matrices, each time 
using a different set of N rows. 

2. The A:-sparse target vector 6q is formed by selecting the locations of 
at most k nonzero elements randomly, by "tossing a coin" with proba- 
bility p = k/l, and fill the values of the nonzero elements according to 
a statistical distribution, e.g., Gaussian, uniform, double exponential, 
Cauchy. 

Other scenarios are also possible. Some authors set all nonzero values 



to one, |Blum 06 , or to ±1, with the randomness imposed on the choice 
of the sign. It must be stressed out that the performance of an algorithm 
may vary significantly under different experimental scenarios, and this may 
be indicative of the stability of an algorithm. In practice, a user may be 
interested in a specific scenario, which is more representative of the available 
data. 



Looking at Fig. 15 the following conclusions are in order. In all curves, 
there is a sharp transition between two levels. From the 100% success to 
the 0% success. Moreover, the higher the dimensionality, the sharper the 



transition is. This has also been shown theoretically in Dono 05 . For 
the algorithm corresponding to the red curves, this transition occurs at 
higher values of k, compared to the algorithm that generates the curve 
drawn in gray. Provided that the computational complexity of the "red" 
algorithm can be accommodated by the resources, which are available for 
a specific application, this seems to be the more sensible choice between 
the two algorithms. However, if the resources are limited, concessions are 
unavoidable. 

Another way to "interrogate" and demonstrate the performance of an 
algorithm, with respect to its robustness to the range of values of sparsity 
levels that can be successfully recovered, is via the so-called phase transition 
curve. To this end define: 

• a := y, which is a normalized measure of the problem indeterminacy. 
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Figure 16: Typical phase transition behavior of a sparsity promoting algo- 
rithm. Black corresponds to 100% success of recovering the sparsest solution 
and red to 0%. For high dimensional spaces, the transition is very sharp, as 
it is the case in the figure. For lower dimensionality spaces, the transition 
from black to red is smoother and involves a region of varying color intensity. 



• /3 := jt , which is a normalized measure of sparsity. 

In the sequel, plot a graph having a S [0, 1] in the horizontal axis and 
(3 G [0, 1] in the vertical one. For each point, (a, (3), in the [0, 1] x [0, 1] region, 
compute the probability of the algorithm to recover a /c-sparse target vector. 
In order to compute the probability, one has to adopt one of the previously 
stated scenarios. In practice, one has to form a grid of points that cover 
densely enough the region [0, 1] x [0, 1] in the graph. Use a varying intensity 
level scale to color the corresponding (a, (3) point. Black corresponds to 



probability one and red to probability zero. Fig. [L6j illustrates the type 
of graph that is expected to be recovered in practice, for large values of /. 
That is, the transition from the region (phase) of "success" (black) to that of 
"fail" (red) is very sharp. As a matter of fact, there is a curve that separates 
the two regions. The theoretical aspects of this curve have been studied in 



the context of combinatorial geometry in |Dono 05 for the asymptotic case, 
I — > oo, and in [Dono 10a] for finite values of I. Observe that the larger the 
value of a (larger percentage of measurements) the larger the value of (3 at 
which the transition occurs. This is in line with what we have said so far in 
this paper, and the problem gets increasingly harder as one moves up and 
to the left in the graph. In practice, for smaller values of I, the transition 
region from white to black is smoother, and it gets narrower as / increases. 
In practice, one can draw an approximate curve that separates the "success" 
and "fail" regions, using regression techniques, see, e.g., [Male 10 . 



The reader may already be aware of the fact that, so far, we have avoided 
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to talk about the performance of individual algorithms. We have just dis- 
cussed some "typical" behavior that algorithms tend to exhibit in practice. 
What the reader might have expected is to discuss comparative performance 
tests and draw related conclusions. We have not done it since we feel that 
it is early in time to have "definite" performance conclusions, and this field 
is still in an early stage. Most authors compare their newly suggested al- 
gorithm with a few other algorithms, usually within a certain algorithmic 
family and, more important, under some specific scenarios, where the ad- 
vantages of the newly suggested algorithm are documented. However, the 
performance of an algorithm can change significantly by changing the ex- 
perimental scenario, under which the tests are carried out. The most com- 
prehensive comparative performance study, so far, has been carried out in 



Male 10 . However, even in this one, the scenario of exact measurements 
has been considered and there are no experiments concerning the robustness 
of individual algorithms to the presence of noise. It is important to say that 
this study involved a huge effort of computation. We will comment on some 
of the findings from this study, which will also reveal to the reader that 
different experimental scenarios can significantly affect the performance of 
an algorithm. 

Fig. 17 1 shows the obtained phase transition curves for a) the iterative 
hard thresholding (IHT), b) the iterative soft thresholding scheme of (48) 
(1ST), c) the Two-Stage-Thresholding scheme (TST), as discussed earlier 
on, d) the LARS and e) the OMP algorithms, together with the theoreti- 
cally obtained one using £i minimization. All algorithms were tuned with 
the optimal values, with respect to the required user-defined parameters, 
after extensive experimentation. The results in the Figure correspond to 
the uniform spherical scenario, for the generation of the sensing matrices. 
Sparse vectors were generated according to the ±1 scenario, for the nonzero 
coefficients. The interesting observation is that, although the curves deviate 
from each other as we move to larger values of f3, for smaller values, the 
differences in their performance become less and less. This is also true for 
computationally simple schemes, such as the IHT one. The performance 
of LARS is close to the optimal one. However, this comes at the cost of 
computational increase. The required computational time for achieving the 
same accuracy, as reported in [Male 10] , favor the TST algorithm. In some 
cases, LARS required excessively longer time to reach the same accuracy, 
in particular when the sensing matrix was the partial Fourier one and fast 
schemes to perform matrix vector products can be exploited. For such ma- 
trices, the thresholding schemes (IHT, 1ST, TST) exhibited a performance 
that scales very well to large size problems. 

Fig. [17) 3 indicates the phase transition curve for one of the algorithms 
(1ST) as we change the scenarios for generating the sparse (target) vectors, 
using different distributions; a) ±1, with equiprobable selection of signs 
(Constant Amplitude Random Selection (CARS)), b) double exponential 
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Figure 17: (a) The obtained phase transition curves for different algorithms 
under the same experimental scenario, together with the theoretical one. (b) 
Phase transition curve for the 1ST algorithm under different experimental 
scenarios for generating the target sparse vector, (c) The phase transition 
for the 1ST algorithms under different experimental scenarios for generating 
the sensing matrix X. 
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(power), c) Cauchy and d) uniform in [—1,1]. This is indicative and typical 
for other algorithms as well, with some of them being more sensitive than 



others. Finally, Fig. 17 z shows the transition curves for the 1ST algorithm 
by changing the sensing matrix generation scenario. Three curves are shown 
corresponding to a) uniform spherical ensemble (USE), b) random sign en- 
semble (RSE), where the elements are ±1 with signs uniformly distributed 
and c) the uniform random projection (URP) ensemble. Once more, one can 
observe the possible variations that are expected due to the use of different 
matrix ensembles. Moreover, changing ensembles affects each algorithm in 
a different way. 

Concluding this section it must be emphasized that the field of algorith- 
mic development is still an ongoing research field and it is early to come with 
definite and concrete comparative performance conclusions. Moreover, be- 
sides the algorithmic front, existing theories often fall short to predict what 
is observed in practice, with respect to their phase transition performance. 



For a related discussion, see, e.g., Dono 10b 



Example 5. We are given a set of N = 20 measurements stacked in the 
y G 1Z N vector. These were taken by applying a sensing matrix X on an 
"unknown" vector in 1Z 50 , which is known to be sparse with k = 5 nonzero 
components; the location of these nonzero components in the unknown vec- 
tor are not known. The sensing matrix was a random matrix with elements 
drawn from a normal distribution J\f(Q, 1) and then the columns were nor- 
malized to unit norm. There are two scenarios for the measurements. In 
the first one, we are given the exact measurements while in the second one 
white Gaussian noise of variance a 1 = 0.025 was added. 

In order to recover the unknown sparse vector, the CoSaMP algorithm 
was used for both scenarios. 



The results are shown in Figs. 18 1 and [18p for the noiseless and noisy 



scenarios, respectively. The values of the true unknown vector 6 are rep- 
resented with black stems topped with open circles. Note that all but five 



of them are zero. In Fig. 18 1 exact recovery of the unknown values is suc- 
ceeded; the estimated values of 0j, % = 1, 2 . . . , 50, are indicated with squares 
in red color. In the noisy case of Fig. [T8p , the resulted estimates, which are 
denoted with squares, deviate from the correct values. Note that estimated 
values very close to zero (\9\ < 0.01) have been omitted from the figure in 
order to facilitate visualizing. In both figures, the stemmed gray filled circles 
correspond to the minimum £2 norm LS solution. The advantages of adopt- 
ing a sparsity-promoting approach to recover the solution are obvious. The 
CoSaMP algorithm was provided with the exact number of sparsity. The 
reader is advised to play with this example by experimenting with different 
values of the parameters and see how results are affected. 
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Figure 18: (a) Noiseless case. The values of the true vector, which generated 
the data for the Example [5j are shown with stems topped with open circles. 
The recovered points, using the CoSaMP, are shown with squares. An exact 
recovery of the signal has been obtained. The stems topped with gray circles 
correspond to the minimum Euclidean norm LS solution, (b) This figure 
corresponds to the noisy counterpart of that in (a) In the presence of noise, 
exact recovery is not possible and the more the power of the noise is the less 
accurate the results are. 
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12 Variations on the S pars ity- Aware Theme 

In our tour, so far, we have touched a number of aspects of the sparsity-aware 
learning that come from the main stream of the theoretical developments. 
However, more and more variants appear, which are developed with the 
goal to address problems of a more special structure and/or to propose 
alternatives, which can be beneficial in boosting the performance in practice, 
by serving the needs of specific applications. These variants focus either on 



the regularization term in ( 13 ) or on the misfit-measuring term or on both. 



Once more, research activity in this direction is dense and our purpose is 
to simply highlight possible alternatives and make the reader alert of the 
various possibilities that spring from the basic theory. 

In a number of tasks, it is a-priori known that the nonzero coefficients in 
the target signal/vector occur in groups and they are not randomly spread 
in all possible positions. Such a typical example is the echo path in internet 
telephony, where the nonzero coefficients of the impulse response tend to 
cluster together, see Fig. [6j Other examples of "structured" sparsity can be 
traced in DNA microarrays, MIMO channel equalization, source localization 
in sensor networks, magnetoencephalography or in neuroscience problems, 
e.g., Bara lOb Parv 08,Garr 08 , Bara 10a| . As it is always the case in 



Machine Learning, being able to incorporate a-priori information in the 
optimization can only be of benefit for improving performance, since the 
estimation task is externally assisted in its effort to search for the target 
solution. 

The group LASSO, |Baki 99[[Yuan 07||Oboz 06||Frie 10] addresses the task 
where it is a-priori known that the nonzero components occur in groups. The 
unknown vector 9 is divided into, say, L groups, i.e., 



TiT 



e T = [eJ,...,e 

each of them of a predetermined size, Sj, i = 
The regression model can then be written as 



1,2,..., L, with ^2 i=1 Si = I. 



where each Xj is a submatrix of X comprising the corresponding Sj columns. 
The solution of the group LASSO is given by the following LS regularized 
task 



6 = arg min 0g7 ^; 



y 



i=l 



''Ml 2 I ' 



(59) 



8=1 



where ||0i|| 2 is the Euclidean norm (not the squared one) of 0i, i.e., 
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Figure 19: (a) The isovalue curves for the l\ and the weighted l\ norms for 
the same value. The weighted l\ is sharply pinched around one of the axis, 
depending on the weights, (b) Adopting to minimize the weighted l\ norm, 



for the setup of Figure 10 the correct sparse solution is obtained. 



In other words, the individual components of 6 that contribute to the forma- 
tion of the l\ norm, in the standard LASSO formulation are now replaced by 
the square root of the energy of each individual block. In this setting, it is 
not the individual components but blocks of them which are forced to zero, 
when their contribution to the LS misfit measuring term is not significant. 
Sometimes, this type of regularization is coined as the regularization. 
It is straightforward to see that if L = I, then the group LASSO becomes the 
standard LASSO method. An alternative formulation of the group sparse 
model using greedy algorithms is considered in [Elda 10] . Theoretical results 
that extend the RIP to the so-called block RIP have been developed and 
reported, see, e.g., 



Blum 09b Lu 08a 



In Cevh 09 Bara 10b 



the so-called model based Compressed Sensing 
is addressed. The (k, C) model allows the significant coefficients of a k- 
sparse signal to appear in at most C clusters, whose size is unknown. This 
is a major difference with the group LASSO, that was reported before. In 
Section 10 it was commented that searching for a /c-sparse solution takes 
place in a union of subspaces, each one of dimensionality k. Imposing a 
certain structure on the target solution restricts the searching in a subset of 
these subspaces and leaves a number of these out of the game. This obviously 
facilitates the optimization task. In [Cevh 09 , a dynamic programming 



technique is adopted to obtain the solution. In Cevh 10 , structured sparsity 
is considered in terms of graphical models. An even more advanced block 
sparsity model is the C-HiLasso, which allows each block to have a sparse 
structure itself, Spre 111. 



In |Cand 08~b , it is suggested to replace the l\ norm by a weighted 
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version of it. To justify such a choice, let us recall Example [2] and the case 
where the "unknown" system was sensed using x = [2, if. We have seen 
that by "blowing" up the l\ ball, the wrong sparse solution was obtained. 



Let us now replace the l\ norm in (28) with its weighted version 



:= ^ll^ll +W2\02\, Wl,W2>0, 



19a shows the isovalue curve = 1, 



and set u>i = 4 and w\ = 1. Fig. 
together with that resulting from the standard l\ norm. The weighted one is 
sharply "pinched" around the vertical axis, and the larger the value of w\ is, 
compared to that of u>2, the sharper the corresponding ball will be. Fig.|19b 
shows what happens when "blowing" the weighted t\ ball. It will first touch 
the point (0, 1), which is the true solution. Basically, what we have done 
is to "squeeze" the t\ ball to be aligned more to the axis that contains the 
(sparse) solution. For the case of our example, any weight w\ > 2 would do 
the job. 

Considering now the general case of a weighted norm 

I 

W e Wl,w '=J2 w i\ e j\> wj>0. 
The ideal choice of the weights would be 

Ooj + 0, 

where Oq is the target true vector, and where we have silently assumed that 
0-oo = 0. In other words, the smaller a coefficient is the larger the respective 
weight becomes. This is justified, since large weighting will force respective 
coefficients towards zero during the minimization process. Of course, in 
practice the values of the true vector are not known, so it is suggested to 
use their estimates during each iteration of the minimization procedure. The 
resulting scheme is of the following form. 

Algorithm 3. 

1. Initialize weights to unity, = 1, j = 1,2,...,Z. 

2. Minimize the weighted l\ norm, 

W = argmin ee7? .i ||0|| 1)W 
s.t. y = xe. 




3. Update the weights 
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Figure 20: One-dimensional graphs of the l\ norm and the logarithmic reg- 
ularizer ln(^ + l) = m (l^l + e ) ~~ l ne ) with e = 0.1. The term lne was 
subtracted for illustration purposes only and does not affect the optimiza- 
tion. Notice the nonconvex nature of the logarithmic regularizer. 



4. Terminate when a stopping criterion is met, otherwise return to step 
2. 

The constant e is a small user-defined parameter to guarantee stability 
when the estimates of the coefficients take very small values. Note that if the 
weights have constant preselected values, the task retains its convex nature; 
this is no longer true when the weights are changing. It is interesting to 
point out that this intuitively motivated weighting scheme can result if the 



l\ norm is replaced by X)j=i m (Ifyl + e ) as the regularizing term of (13) 



Fig. 20 shows the respective graph, in the one-dimensional space together 
with that of the t\ norm. The graph of the logarithmic function reminds 
us of the £ p , p < < 1, "norms" and the comments made in Section [3j 
This is no more a convex function and the iterative scheme, given before, 
is the result of a majorization-minimization procedure in order to solve the 



resulting non-convex task, Cand 08b 



The concept of the iterative weighting, as used before, has also been 
applied in the context of the iterative reweighted least squares algorithm. 
Observe that the l\ norm can be written as 

l 

3=1 



where 



|0i I 






w\ 







m. 
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and where in the case of B% = 0, for some i E {1,2,...,/}, the respective 
coefficient of Wg is defined to be 1 . If Wg were a constant weighting matrix, 
i.e., Wg := Wq, for some fixed 6, then obtaining the minimum 

6 = argmin fle7e ' \\y - XG\\ 2 2 + \6 T W § 0, 

is straightforward and similar to the ridge regression. In the iterative reweighted 
scheme, Wg is replaced by Wq{%) , formed by using the respected estimates of 
the coefficients which have been obtained from the previous iteration, i.e., 
6 := 0W, as we did before. In the sequel, each iteration solves a weighted 
ridge regression task. Variants of this basic iteratively weighting scheme 



have also been proposed, see, e.g., [Daub 10 and the references therein. 



In |Cand 07 , the LASSO task is modified by replacing the square error 



term with one involving correlations and the minimization task becomes 

: min 1101 

oen 1 



s.t. WxTfr-XO^Ke, 



where e is related to I and the noise variance. This task is known as the 
Dantzig selector. That is, instead of constraining the energy of the error, the 
constraint, now, imposes an upper limit to the correlation of the error vector 
with any of the columns of X. In |Bick 09 Asif 10b , it is shown that under 



certain conditions the LASSO estimator and the Dantzig selector become 
identical. 

Total Variation (TV) |Rudi 92\ is a closely related to i\ sparsity promo- 
tion notion and it has been widely used in image processing. Most of the 
grayscale image arrays, I G lZ lxl , consist of slowly varying pixel intensities 
except at the edges. As a consequence, the discrete gradient of an image 
array will be approximately sparse (compressible). The discrete directional 
derivatives of an image array are defined pixel- wise as 

V x (I)(i,j) := I(i + lJ)-I(i,j), V»e{l,2,...,Z-l}, (60) 
V y (I)(i,j) := I(i,j + 1)-I(i,j), VjG {1,2,...,/- 1}, (61) 

and 

V x (WJ)--=V y (I)(i,l):=Q, Vi,je{l,2,...,Z-l}. (62) 
The discrete gradient transform 

V : U lxl ^ TZ lx21 , 

is defined in terms of a matrix form as 

V(I)(i,j):=[V x (i,j),V y (i,j)}, Vi,i€{l,2,...Z}. (63) 
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The total variation of the image array is defined as the l\ norm of the 
magnitudes of the elements of the discrete gradient transform, i.e., 

ll 11 

Wtv : = E E \mwj)\\2 = E E + v,(/) 2 (*, j). 

i=l j'=l i=l j'=l 

(64) 

Note that this is a mixture of £2 and £1 norms. The sparsity promoting 
optimization around the total variation is defined as 

/* G argmin ||J|| TV 

s.t. ||y-^(/)|| 2 <€, (65) 

where y 6 72. is the measurements vector and J- (I) denotes the result in 
vectorized form of the application of a linear operator on /. For example, 
this could be the result of the action of a partial two-dimensional DFT on 
the image. Subsampling of the DFT matrix as a means to form sensing 



matrices has already been discussed in Section 8.2 The task in (65) retains 
its convex nature and it basically expresses our desire to reconstruct an 
image which is as smooth as possible given the available measurements. The 
NESTA algorithm can be used for solving the total variation minimization 
task; besides it, other efficient algorithms for this task can be found in, e.g., 



Gold 09, Yang 10 



It has been shown in [Cand 06a , for the exact measurements case (e = 0), 



and in |Need 12 , for the erroneous measurements case, that conditions and 



bounds which guarantee recovery of an image array from the task in (65) 
can be derived and are very similar with those that we have discussed for 
the case of the i\ norm. 

Example 6 (Magnetic Resonance Imaging (MRI)). In contrast to ordinary 
imaging systems, which directly acquire pixel samples, MRI scanners sense 
the image in an encoded form. Specifically, MRI scanners sample com- 
ponents in the spatial frequency domain, known as "fc-space" in the MRI 
nomenclature. If all the components in this transform domain were avail- 
able, one could apply the inverse 2D-DFT to recover the exact MR image 
in the pixel domain. Sampling in the A;-space is realized along particular 
trajectories in a number of successive acquisitions. This process is time 
consuming, merely due to physical constraints. As a result, techniques for 
efficient image recovery from a limited number of measurements is of high 
importance, since they can reduce the required acquisition time for perform- 
ing the measurements. Long acquisition times are not only inconvenient but 
even impossible, since the patients have to stay still for long time intervals. 
Thus, MRI was among the very first applications where compressed sensing 
found its way to offer its elegant solutions. 



Fig. 21 1 shows the "famous" Shepp-Logan phantom, and the goal is to 



recover it via a limited number of (measurements) samples in its frequency 
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(c) (d) 

Figure 21: a) The original Shepp-Logan image phantom, b) The white 
lines indicate the directions across which the sampling in the spatial Fourier 
transform were obtained, c) The recovered image after applying the inverse 
DFT having first filled with zeros the missing values in the DFT transform, 
d) The recovered image using the total variation minimization approach. 



domain. The MRI measurements are taken across 17 radial lines in the 



spatial frequency domain, as shown in Fig. 21 b). A "naive" approach to 
recover the image from this limited number of measuring samples would be 
to adopt a zero-filling rationale for the missing components. The recovered 
image according to this technique is shown in Fig. 21 c). Fig. 21 d) shows 
the recovered image using the approach of minimizing the total variation, 
as explained before. Observe that the results for this case are astonishingly 
good. The original image is almost perfectly recovered. The constrained 
minimization was performed via the NESTA algorithm. Note that if the 
minimization of the l\ norm of the image array were used in place of the 
total variation, the results would not be as good; the phantom image is 
sparse in the discrete gradient domain, since it contains large sections which 
share constant intensities. 
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13 Online Time- Adaptive Sparsity-Promoting Al- 
gorithms 

In this section, online (time-recursive) schemes for sparsity-aware learning 
are presented. There is a number of reasons that one has to resort to such 
schemes in various signal processing tasks, for example, when the data arrive 
sequentially. Under such a scenario, using batch processing techniques to 
obtain an estimate of an unknown target parameter vector would be highly 
inefficient, since the number of training points keeps increasing. Such an 
approach is prohibited for real time applications. Moreover, time-recursive 
schemes can easily incorporate the notion of adaptivity, when the learning 
environment is not stationary but it undergoes changes as time evolves. Be- 
sides signal processing applications, there is an increasing number of machine 
learning applications where online processing is of paramount importance, 
such as bioinformatics, hyperspectral imaging, and data mining. In such ap- 
plications, the number of training points easily amounts to a few thousand 
up to hundred of thousand points. Concerning the dimensionality of the 
ambient (feature) space, one can claim numbers that lie in similar ranges. 
For example, in [Lang 09] , the task is to search for sparse solutions in fea- 
ture spaces with dimensionality as high as 10 9 having access to data sets as 
large as 10 points. Using batch techniques, in a single computer, is out of 
question with today's technology. 

Let us assume that there is an unknown parameter vector that generates 
data according the standard regression model 

y n = x^6 + r] n , Vn, 

and the training samples are received sequentially (y n , x n ), n = 1, 2, . . .. In 
the case of a stationary environment, we would expect our algorithm to con- 
verge, asymptotically as n — > oo, to or "near to" the true parameter vector 
that gives birth to the measurements, y n , when it is sensed by x n . For time 
varying environments, the algorithms should be able to track the underly- 
ing changes as time goes by. Before we proceed, a comment is important. 
Since the time index, n, is left to grow, all we have said in the previous 
sections with respect to underdetermined systems of equations, looses its 
meaning. Sooner or later we are going to have more measurements than 
the dimensionality of the space. Our major concern here becomes the issue 
of asymptotic convergence, for the case of stationary environments. The 
obvious question, that is now raised, is why not using a standard algorithm, 



e.g., LMS, RLS or APSM Saye 03 Hayk 96 Theo 11 , since we know that 



these algorithms converge to, or near enough in some sense, to the solution; 
that is, the algorithm will identify the zeros asymptotically. The answer is 
that if such algorithms are modified to be aware for the underlying sparsity, 
convergence is significantly speeded up; in real life applications, one has not 
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the "luxury" to wait long time for the solution. In practice, a good algo- 
rithm should be able to provide a good enough solution, and in the case 
of sparse solutions to obtain the support, after a reasonably small number 
of iteration steps. In this section, the powerful theory around the l\ norm 
regularization will be used to obtain sparsity-aware time adaptive schemes. 



13.1 LASSO: Asymptotic Performance 



The notions of bias, variance and consistency, are major indices for assessing 
the performance of an estimator. In a number of cases, such performance 
measures are derived asymptotically. For example, it is well known that the 
maximum likelihood estimator is asymptotically unbiased and consistent 
Theo 09 . Also the LS estimator is asymptotically consistent. Moreover, 
under the assumption that the noise samples are i.i.d., the LS estimate, On, 
that is obtained using N measurement (training) samples, is itself a random 
vector, that satisfies the \/iV-estimation consistency, e.g., Kay 93 , i.e., 



N[0 N -0 ) 4 M (0, a 2 ^ 1 ) 



where 0$ is the true vector that generates the measurements, a 1 denotes 
the variance of the noise source and £ is the covariance matrix E[a;a; r ] of 
the input sequence, which has been assumed to be zero mean and the limit 
denotes convergence in distribution. 

The LASSO in (13) is the task of minimizing the t\ norm regularized 



version of the LS cost. However, nothing has been said, so far, about the sta- 
tistical properties of this estimator. The only performance measure that we 



referred to was the error norm bound given in (43). However, this bound, 



although important in the context it was proposed for, does not provide 
much statistical information. Since the introduction of the LASSO estima- 
tor, a number of papers have addressed problems related to its statistical 
performance, see, e.g., |Dono 95 Knig 00, Fan Ol Zou 06 



When dealing with sparsity-promoting estimators, such as the LASSO, 
two crucial issues emerge: a) whether the estimator, even asymptotically, 
can obtain the support, if the true vector parameter is a sparse one and 
b) quantify the performance of the estimator with respect to the estimates 
of the nonzero coefficients, i.e., those whose index belongs to the support. 
Especially for LASSO, the latter issue becomes to study whether LASSO 
behaves as well as the unregularized LS with respect to these nonzero com- 
ponents. This task was addressed, for a first time and in a more general 
setting, in |Fan 01 . Let the support of the true, yet unknown, fc-sparse 



parameter vector 6q be denoted as S. Let also £15 be the k x k covari- 
ance matrix Efx^a;^], where x\g G lZ k is the vector that contains only the 
k components of x, with indices in the support S. Then, we say that an 
estimator satisfies asymptotically the oracle properties if: 
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• lim7v-s>oo Prob ^S§ N = s\ = 1. This is known as support consistency. 

• \/~N (On\S ~ Oo\Sj N (o, fj2 ^'|s 1 ) • This is the \fN - estimation con- 
sistency. 

We denote as #o|S an d 6n\s the /c-dimensional vectors which result from 
#o> Qn-i respectively, if we keep the components whose indices lie in the sup- 
port S. In other words, according to the oracle properties, a good sparsity- 
promoting estimator should be able a) to predict, asymptotically, the true 
support and b) its performance with respect to the nonzero components 
should be as good as that of a genie-aided LS estimator, which is informed, 
in advance, of the positions of the nonzero coefficients. 

Unfortunately, the LASSO estimator cannot satisfy simultaneously both 



conditions. It has been shown, Knig 00, Fan 01,Zou 06] that: 



For support consistency, the regularization parameter A := Atv should 
be time varying such as 

lim ^— = oo, lim — — = 0. 

N^oo J]S[ iY-s-oo N 



That is, Xn must grow faster than yN, but slower than N. 
For "v/iV-consistency, A at must grow as 

lim = 0, 



i.e., it grows slower than y/N. 

The previous two conditions are conflicting and the LASSO estimator 
cannot comply with the two oracle conditions simultaneously. The proofs 
of the previous two points are somewhat technical and are not given here. 
The interested reader can obtain them from the previously given references. 
However, before we proceed, it is instructive to see why the regularization 
parameter has to grow slower than N, in any case. Without being too rig- 
orous mathematically, recall that the LASSO solution comes from equation 



(13). This can be written as 

0£ ~ Yx nVn + ^- I > \r„xl \ 0+ ^-r;||0||j ■ (00) 




where we have divided by iV both sides. Taking the limit as N — > oo, if 
\n/N —¥ 0, then we are left with the first two terms; this is exactly what 
we would have if the unregularized LS had been chosen as the cost function. 
In this case, the solution asymptotically converges^] (under some general 
assumptions, which are assumed to hold true, here) to the true parameter 
vector; that is, we have strong consistency, e.g., |Kay 93 . 



Recall that this convergence is with probability 1. 
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13.2 The Adaptive Norm- Weighted LASSO 

There are two ways to get out of the previously stated conflict. One is 
to replace the l\ norm with a nonconvex function and this can lead to an 
estimator that satisfies the oracle properties simultaneously |Fan 01| . The 
other is to modify the l\ norm by replacing it with a weighted version. Recall 



that the weighted i\ norm was discussed in Section 12, as a means to assist 
the optimization procedure to unveil the sparse solution. Here the notion of 
weighted i\ norm comes as a necessity imposed by our willingness to satisfy 
the oracle properties. This gives rise to the adaptive norm-weighted LASSO 
cost estimator defined aa^ 

= arg mm eenl J £ ^ [ Uj - x jo) 2 + X n £ w % (n) 1 9 { | 1 , (67) 
[j=i i=i J 

where (3 < 1 is used as the forgetting factor to allow for tracking slow 
variations. The time varying weighting sequences is denoted as Wi(n). There 
are different options. In |Zou 06] and under a stationary environment with 
[3 = 1, it is shown that if 

where 6>f st is the estimate of the ith component obtained by any -y/n-consistent 
estimator, such as the unregularized LS, then for specific choices of A n and 
7 the estimator satisfies the oracle properties simultaneously. The main 
reasoning behind the weighted norms is that as time goes by, and the yjn- 
consistent estimator provides better and better estimates, then the weights 
corresponding to indices outside the true support (zero values) are inflated 
and those corresponding to the true support converge to a finite value. This 
helps the algorithm, simultaneously, to locate the support and obtain unbi- 
ased (asymptotically) estimates of the large coefficients. 

Another choice for the weighing sequence is related to the so called 



Smoothly Clipped Absolute Deviation (SCAD) [Fan 01 Zou 08 . This is de- 
fined as 

(a fi n - |0, cst |L 
m(n) = X(o^)(|flF*l) + (a _i)pJ - XOwxOdflri), 

where x(') stands for the characteristic function, /x n = X n /n, and a > 2. 
Basically, this corresponds to a quadratic spline function. It turns out, 



Zou 08 , that if A ra is chosen to grow faster that \/n and slower that n, the 



adaptive LASSO, with fj = 1 satisfies both oracle conditions, simultaneously. 



5 To emphasize that the number of training points is now increasing, we have used n in 
place of N. Capital N was previously used to denote a fixed number of points. 



13.3 Adaptive CoSaMP Algorithm (AdCoSaMP) 



77 



A time adaptive scheme for solving the time adaptive norm-weighted 



(TNWL) LASSO was presented in Ange 10 . The cost function of the 



adaptive LASSO in (67) can be written as 

j{e) = e T R n o-rlo + \ n \\e\\ 1 

where 



w(n) 



Rn := 

3=1 



3^j 



3=1 



and ||0||i w ( n ) is the weighted t\ norm. It is straightforward to see, that 



Rn 



PRfl—1 ^n^ni f"n — (3T n —l -\- y n X n . 



The complexity for both of the previous updates, for matrices of a general 
structure, amounts to 0(l 2 ) multiply/add operations. One alternative is to 
update R n and r n and then solve a convex optimization task for each time 
instant, n, using any standard algorithm. However, this is not appropri- 
ate for real time applications, due to its excessive computational cost. In 
Ange 10 , a time recursive version of a coordinate descent algorithm has 



been developed. As we have seen in Section 11.2 coordinate descent algo- 



rithms update one component at each recursive step. In Ange 101, recursive 



steps are associated with time updates, as it is always the case with the time- 
recursive algorithms. As each new training pair (y n , x n ) is received, a single 
component of the unknown vector is updated. Hence, at each time instant, 
a scalar optimization task has to be solved and its solution is given in closed 
form, which results in a simple soft thresholding operation (OCCD-TWL). 
If the weighted norm is to be used in place of the l\ , a RLS is run in parallel 
to provide the necessary weights. One of the drawbacks of the coordinate 
techniques is that each coefficient is updated every I time instants, which, for 
large values of I, can slow down convergence. Variants of the basic scheme 
that cope with this drawback are also addressed in [Ange 10" , referred to 
as online cyclic coordinate descent (OCCD-TNWL). The complexity of the 
scheme is of the order of 0(l 2 ). Computational savings are possible, if the 
input sequence is a time series and fast schemes for the updates of R n and 
the RLS can then be exploited. However, if an RLS-type algorithm is used 
in parallel, the convergence of the overall scheme may be slowed down, since 
the RLS-type algorithm has to converge first, in order to provide reliable 
estimates for the weights, as pointed out before. 

13.3 Adaptive CoSaMP Algorithm (AdCoSaMP) 



In Mile 10 , an adaptive version of the CoSaMP algorithm, which was pre- 
sented in Section 11.1.3, was proposed. Iteration steps, i, now coincide with 



time updates, n, and the LS solver in Step [3c] of the general CSMP scheme 
is replaced by an LMS one. 
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3 a 



of the CSMP 



Let us focus first on the quantity X T e^~ 1 ^ in Step 
scheme, which is used to compute the support at iteration i. In the adaptive 
setting and at (iteration) time n, this quantity is now "rephrased" as 

n-l 

X T e{n-l) = Y J *Aj)- 
J'=l 

In order to make the algorithm flexible to adapt to variations of the envi- 
ronment, as the time index, n, increases, the previous correlation sum is 
modified to 

n-l 

p(n) ■- IP-i-'xjeV) = Pp(n - 1) + x n . x e(n - 1). 
i=i 

The LS task, constrained on the active columns that correspond to the 
indices in the support S in Step 3c, is performed in an adaptive rationale 
by involving the basic LMS recursions, i.e., 



6\ s (n) := 0\ S (n-i)+nx n \ S e(n) 



e(n) := y n - x n]s O\ s (n - 1) 



where 0\s(-) and x n \$ denote the respective subvectors corresponding to the 
indices in the support S. The resulting algorithm is given as follows. 

Algorithm 4 (The AdCoSaMP Scheme). 

1. Select the value of t = 2k. 

2. Initialize the algorithm: 6(1) = 0, 0(1) = 0, p(l) = 0, e(l) = y\. 

3. Choose n and f3. 

4. For n = 2, 3, . . ., execute the following steps. 

(a) p(n) = (3p(n - 1) + x n -±e(n - 1). 



(b) Obtain the current support: 

5' = supp{0(n-l)}U 

(c) Perform the LMS update: 



indices of the t largest 
in magnitude components of p(n) 



e(n) = y n -xl ls 6 ls (n-l), 
0\ s (n) = 9\ s (n - 1) + /j,x n \ s e{n). 

(d) Obtain the set Sk of the indices of the k largest components of 
9\ S (n). 
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(e) Obtain 0(n) such that: 

d \s k (") = 0\s k , and 0\ S c (n) = 0, 

where S£ is the complement set of 

(f) Update the error: e(n) = y n — x^6(n). 

In place of the standard LMS, its normalized version can alternatively be 
adopted. Note that Step |4e] is directly related to the hard thresholding 
operation. 



In |Mile 10 , it is shown that if the sensing matrix, which is now time de- 



pendent and keeps increasing in size, satisfies a condition similar to RIP, for 
each time instant, called Exponentially Weighted Isometry Property (ERIP), 
which depends on /3, then the algorithm asymptotically satisfies an error 
bound, which is similar to the one that has been derived for CoSaMP in 
|Need 09] , plus an extra term that is due to the excess Mean Square Error, 
which is the price paid by replacing the LS solver by the LMS. 

13.4 Sparse Adaptive Parallel Projection onto Convex Sets 
Method (SpAPSM) 

The APSM family of algorithms is one among the most powerful techniques 



for adaptive learning |Theo 11 . A major advantage of this algorithmic fam- 
ily is that one can readily incorporate convex constraints. The rationale 
behind APSM is that since our data are known to be generated by a re- 
gression model, then the unknown vector could be estimated by finding a 
point in the intersection of a sequence of hyperslabs, that are defined by the 
data points, i.e., S n [e] := {6 S 1Z 1 : \y n — x^6\ < e}. Such a model is most 
natural when the noise is bounded, (which, after all, it is the case in any 
practical application). In case the noise is assumed unbounded, a choice of e 
of the order say, a, can guarantee, with high probability, that the unknown 
solution lies inside these hyperslabs. 

The APSM family builds upon the elegant philosophy that runs across 
the classical projections onto convex sets (POCS) theory. Recall that the 
basic rationale behind POCS is that starting from an arbitrary point in the 
space and sequentially projecting onto a finite number of convex sets then 
the sequence of projections converges, in some sense, into the intersection 
of all these sets, assuming this is not empty. The theory was extended 
to embrace the online processing setting in [Yama 04" Ogur 02 Slav 06 



In contrast to the classical POCS theory, here the number of the involved 
convex sets is infinite. It turns out that, under certain general conditions, a 
sequence of projections over all these sets also converges to a point in their 
intersection. 

To fit the theory into our needs, the place of the aforementioned convex 
sets is taken by the hyperslabs, which are formed by the received training 
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data, as mentioned before. Thus, the resulting algorithms involves (metric) 
projections onto these hyperslabs (see Appendix). However, when dealing 
with sparse vectors, there is an extra projetion associated with the convex set 
formed by the l\ ball; that is, H^l^ < p (see, also, the LASSO formulation 



(14)). Hence, this task fits nicely in the APSM rationale and the basic 



recursion can be readily written, without much thought or derivation, as 
follows; for any arbitrarily chosen initial point 0(0), define Vre, 

e{n):=P B Je{n-l)+p n (- J2 P Si[e](0(ri-l))-e(n-l) \ J 

where Psm is the metric projection onto the hyperslab Si[e] (see Appendix). 
Note, that in the previous recursion we have used q, instead of one, hyper- 
slabs whose metric projections are averaged out at time re. It turns out that 
such an averaging improves convergence significantly. Parameter p n is an 
extrapolation parameter, which takes values in the interval (0, 2j\A n ), where 

» 



E^„- g+ i^ni^ <[e ](Q("-i))-o("-i)i 

||EL„- 9+1 ^ (n) P SjE] (0(n-l))-0(n-l)| 



if ^2i=n-q+l 

1, otherwise, 



co- 



in) 



P 5i[e] (0( n -l))-0(re-l) 



(68) 



and P B( [pj(-) is the projection operator onto the l\ ball -B^lp] := {0 6 : 
II Hi < p}-, since the solution is constrained to live within this ball. Note, 
that the previous recursion is analogous to the iterative soft thresholding 



shrinkage algorithm in the batch processing case, (53). There, we saw that 



the only difference that the sparsity imposes on an iteration, with respect 
to its unconstrained counterpart, is an extra soft thresholding. This is ex- 
actly the case here. The term in the parenthesis is the iteration for the 
unconstrained task, 
tion on the 



Moreover, as it has been shown in Duch 08 



projec- 



l\ ball is equivalent to a soft thresholding operation. It can be 
shown that the previous iteration converges arbitrarily close to a point in 
the intersection 

B tl [S\n P| S n [e], 

n>no 



for some finite value of n [Yama 04|Ogur 02|Slav 06||Theo 11| . In [Kops 11a 

Kops lib the weighted i\ ball has been used to improve convergence as 



well as the tracking speed of the algorithm, when the environment is time 
varying. The weights were adopted in accordance to what was discussed in 



Section 12 i.e. 
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where (e n )n>Q is a sequence (can be also constant) of small numbers to 
avoid division by zero. The basic time iteration becomes as follows; for any 
arbitrarily chosen initial point 0(0), define Vn, 



0(n) := 



P B^ [w(n),p] 



9{n- 



W- fa 



[e] (0(n-l))-0(n- 



i=n—q+l 




where //„ G (0, 2A4 n ) and n is given in (|68l). Fig. |22| illustrates the asso- 
ciated geometry of the basic iteration in l£ 2 and for the case of q = 2. It 
comprises two parallel projections on the hyperslabs followed by one projec- 
tion onto the weighted l\ ball. In Kops 11a] , it is shown that a good bound 
for the weighted l\ norm is the sparsity level k of the target vector, which 



is assumed to be known and it is a user-defined parameter. In Kops 11a 



it is shown that asymptotically, and under some general assumptions, this 
algorithmic scheme converges arbitrarily close to the intersection of the hy- 
perslabs with the weighted l\ balls, i.e., 



n 

n>no 



P B tl [w{n),p\ n Sj[e\ 



for some non-negative integer no- It has to be pointed out that, in the case 
of weighted l\ norms, the constraint is time varying and the convergence 
analysis is not covered by the standard analysis used for APSM, and had 
to be extended to this more general case. The complexity of the algorithm 
amounts to 0(ql). The larger the q the faster the convergence rate, at the 
expense of higher complexity. In Kops lib , in order to reduce the depen- 



dence of the complexity on q, the notion of the sub -dimensional projection 
is introduced, where projections onto the q hyperslabs could be restricted 
along the directions of the most significant coefficients, of the currently avail- 
able estimates. The dependence on q now becomes 0(qk n ) where k n is the 
sparsity level of the currently available estimate, which, after a few steps 
of the algorithm, gets much lower than I. The total complexity amounts to 
0(1) + 0(qk n ), per iteration step. This allows the use of large values of q, 
which drives the algorithm to a performance close to that of the adaptive 
weighted LASSO, at only a small extra computational cost. 



13.4.1 Projection onto the Weighted l\ Ball 

Projecting onto an l\ ball is equivalent to a soft thresholding operation. 
Projection onto the weighted l\ norm results to a slight variation of the soft 
thresholding, with different threshold values per component. In the sequel, 
we give the iteration steps for the more general case of the weighted l\ ball. 
The proof is a bit technical and lengthy and it will not be given here. It was 
derived, for the first time, via purely geometric arguments, and without the 
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Figure 22: Geometric illustration of the update steps involved in the 
SpAPSM algorithm, for the case of q = 2. The update at time n + 1 is 
obtained by first convexly combining the projections onto the current and 
previously formed hyperslabs, S n [e], 5 n _i[e] and then projecting onto the 
weighted i\ ball. This brings the update closer to the target solution 0*. 



use of the classical Lagrange multipliers, in [Kops 11a] , Lagrange multipliers 



have been used instead in Duch 08 , for the case of the l\ ball. 

Given a point outside the ball, 6 G 1Z 1 \ B^[w,p], then its projection 
onto the weighted l\ ball is the point Pg \w,p\ W £ Bi 1 [w,p] := {z € 1Z 1 : 

Yli=i w i\ z i\ — p}i that lies closest to 6 in the Euclidean sense. If 6 lies 
within the ball then it coincides with its projection. Given the weights and 
the value of p, the following iterations provide the projection. 

Algorithm 5 (Projection onto the weighted l\ ball Bfjto, p]). 

1. Form the vector . . . , \9i\/wi] T £ 1Z 1 . 

2. Sort the previous vector in a non-ascending order, so that |0 T (i)l A°t(i) ^ 
• •• > \G t (1)\I w t(1)- The notation r stands for the permutation, which 
is implicitly defined by the sorting operation. Keep in memory the 
inverse r _1 , which is the index of the position of the element in the 
original vector. 

3. n := I. 

4. Let m = 1. While m < I, do: 



(a) m* := m. 
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(b) Find the maximum among those j S {1, 2, . . . , r m } such that 

(c) If j* = r m then break the loop. 

(d) Otherwise set r m+ \ := j*. 

(e) Increase m by 1 and go back to Step 4a 



5. Form the vector p £ 7£ rm * whose j-th component, j = 1, r m _, is 
given by 

m | Ei=i w T{i)\9 T {i)\ -P 

Pj - Fr(i)l V r mt 2 W T (j)- 

2-ui=\ W r(i) 

6. Use the inverse mapping r _1 to insert the element pj into the r _1 (j) 
position of the /-dimensional vector p, Vj S {1,2,... r m „}, and fill in 
the rest with zeros. 

7. The desired projection is PB h [w, P ](0) = [sgn(0i)pi, . . . , sgn(6»;)^] T . 

Remarks 12. Projections onto both l\ and weighted t\ balls impose convex 
sparsity inducing constraints via properly performed soft thresholding op- 



erations. More recent advances within the SpAPSM framework Kops 12b] , 
allow the substitution of Pg e y and Pg t r.„,y with a generalized thresh- 
olding, built around the notions of SCAD, nonegative garrote, as well as a 
number of thresholding functions corresponding to the non-convex, £ p , p < 1 
penalties. Moreover, it is shown shown that such generalized thresholding 
operators (GT) are nonlinear mappings with their fixed point set being a 
union of subspaces, i.e., the non-convex object which lies at the heart of any 
sparsity-promoting technique. Such schemes are very useful for low values of 
q, where one can improve upon the performance obtained by the LMS-based 
AdCoSAMP, at comparable complexity levels. 

Example 7. ( Time varying signal) In this example, the performance curves 
of the most typical online algorithms, mentioned before, are studied in the 
context of a time varying environment. A typical simulation setup, which is 
commonly adopted by the adaptive filtering community, in order to study 
the tracking agility of an algorithm, is that of an unknown vector which 
undergoes an abrupt change after a number of observations. Here, we con- 
sider a signal, s, with a sparse wavelet representation, i.e., s = where $ 
is the corresponding transformation matrix. In particular, we set I = 1024 
with 100 nonzero wavelet coefficients. After 1500 measurements (obser- 
vations), ten arbitrarily picked wavelet coefficients change their values to 
new ones selected uniformly at random from the interval [—1 1]. Note 
that this may affect the sparsity level of the signal, and we can now end 
with up to 110 nonzero coefficients. A total of iV = 3000 sensing vec- 
tors are used, which result from the wavelet transform of the input vectors 
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-0 — Sparse APSM-Weighted d 
#- - OCCD-TWL, 8 = 0.996 
e — AdCoSaMP, y, = 0.3, /3 = 0.999 




1500 2000 
Iterations 



3000 



Figure 23: MSE learning curves for AdCoSAMP, SpAPSM and OCCD-TWL 
for the simulation example discussed in[7| The vertical axis shows the log 10 
of the Mean Squares Error, i.e., log 10 (j^ \\s — $0(n)||2^ and the horizontal 
shows the time index. At time n = 1500, the system undergoes a sudden 
change. 



x n G 1Z , n = 1, 2, . . . , 3000, having elements drawn from A/"(0, 1). In this 
way, the adaptive algorithms do not estimate the signal itself, but its sparse 
wavelet representation, 9. The observations are corrupted by additive white 
Gaussian noise of variance = 0.1. Regarding SpAPSM, the extrapolation 
parameter /j> n is set equal to 1.8 x Ai n , the hyperslabs parameter e was set 
equal to 1.3cr n and q = 390. The parameters for all algorithms were selected 
in order to optimize their performance. Since the sparsity level of the signal 
may change (from k = 100 up to k = 110) and since in practice it is not 
possible to know in advance the exact value of k, we feed the algorithms 
with an overestimate, k, of the true sparsity value and in particular we used 
k = 150 (i.e., 50% overestimation up to the 1500-th iteration). 



The results are shown in Fig. 23 Note the enhanced performance ob- 



tained via the SpAPSM algorithm. However, it has to be pointed out that 
the complexity of the AdCoSAMP is much lower compared to the other 
two algorithms, for the choice of q = 390 for the SpAPSM. The interest- 
ing observation is that SpAPSM achieves a better performance compared 
to OCCD-TWL, albeit at significantly lower complexity. If on the other 
hand complexity is of major concern, as it has already been pointed out in 
12] use of SpAPSM offers the flexibility to use GT operators, which lead 
to improved performance for small values of q at complexity comparable to 
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that of LMS-based sparsity promoting algorithms Kops 12a| . 



14 Learning Sparse Analysis Models 

All our discussion, so far, has been exhausted in the terrain of signals which 
are either sparse themselves or they can be sparsely represented in terms of 



the atoms of a dictionary in a synthesis model, as introduced in (23), i.e 



As a matter of fact, most of the research activity over the last decade or 
so has been focused on the synthesis model. This may be partly due to 
the fact that the synthesis modeling path may provide a more intuitively 
appealing structure to describe the generation of the signal in terms of the 



elements (atoms) of a dictionary. Recall from Section 10 that the sparsity 



assumption was imposed on in the synthesis model and the corresponding 



optimization task was formulated in (45) and (46) for the exact and noisy 
cases, respectively. 

However, this is not the only way to attack the task of sparsity modeling. 
Very early in this paper, in Section [5j we referred to the analysis model, 

S = $ H s 

and pointed out that in a number of real life applications the resulting 
transform S is sparse. To be fair, for such an experimental evidence, the 
most orthodox way to deal with the underlying model sparsity would be 
to consider ||<3?^s|| . Thus, if one wants to estimate s, a very natural way 
would be to cast the related optimization task as 

min 11$ sll , 

s.t. y = Xs, or \\y - Xs\\l< e, (70) 

depending on whether the measurements via a sensing matrix, X, are exact 
or noisy. Strictly speaking, the total variation minimization approach, which 
was used in Example [6j falls under this analysis model formulation umbrella, 
since what is minimized is the t\ norm of the gradient transform of the image. 



The optimization tasks in either of the two formulations given in (70) 
build around the assumption that the signal of interest has sparse analy- 
sis representation. The obvious question that is now raised is whether the 



optimization tasks in (70) and their counterparts in (45) or (46) are any dif- 



ferent. One of the first efforts to shed light in this problem was in |Elad 07b 



There, it is pointed out that the two tasks, although related, yet they are 
in general different. Moreover, their comparative performance depends on 
the specific problem at hand. However, it is fair to say that this is a new 
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field of research and more definite conclusions are currently being shaped. 
An easy answer can be obtained for the case where the involved dictionary 
corresponds to an orthonormal transformation matrix, e.g., DFT. In this 
case, we already know that the analysis and synthesis matrices are related 
as 



$ = $ = \J/ 



-n 



which leads to an equivalence between the two previously stated formula- 
tions. Indeed, for such a transform we have 



$> H s 



, s = $S. 

Analysis Synthesis 



Using the last formula into the (70), the tasks in (45) or (46) are readily 



obtained by replacing 6 by s. However, this reasoning cannot be carried out 
to the case of overcomplete dictionaries; it is for these cases, where the two 
optimization tasks may lead to different solutions. 

The previous discussion, concerning the comparative performance be- 
tween the synthesis or analysis-based sparse representations, is not only of 
a "philosophical" value. It turns out that, often in practice, the nature of 
certain overcomplete dictionaries does not permit the use of the synthesis 
based formulation. These are the cases where the columns of the overcom- 
plete dictionary exhibit high degree of dependence; that is, the coherence 



of the matrix, as defined in section 7.1, has large values. Typical examples 
of such overcomplete dictionaries are the Gabor frames, the curvelet frames 
and the oversampled DFT. The use of such dictionaries lead to enhanced 
performance in a number of applications, e.g., |Star 02 , Star 07] . Take as 



an example the case of our familiar DFT transform. This transform pro- 
vides a representation of our signal samples in terms of sampled exponential 
sinusoids, whose frequencies are multiples of where T is the sampling 
frequency and IT is the length of our signal segment s; that is, 



s :- 



S (0) 
s(T) 

s((l-l)T) 



i-i 



(71) 



i=0 



where Si are the DFT coefficients and tpi is the sampled sinusoid with fre- 
quency equal to j£ri, i.e., 



1 



exp (-jjjtiT) 



exp - 1)T) 



(72) 
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However, this is not necessarily the most efficient representation. For ex- 
ample, it is highly unlikely that a signal comprises only frequencies which 
are multiples of the basic one; only such signals can result in a sparse rep- 
resentation using the DFT basis. Most probably, in general, there will be 
frequencies lying in between the frequency samples of the DFT basis, which 
result in non-sparse representations. This can be remedied by increasing 
the number I of the atoms, and form a dictionary that involves sinusoids 
with frequencies taken at smaller frequency intervals. However, in such a 
dictionary the atoms are no more linearly independent and the coherence of 
the respective (dictionary) matrix increases. 

Once a dictionary exhibits high coherence, then there is no way of finding 
a sensing matrix, X, so that X^> to obey the RIP. Recall that at the heart 
of the sparsity-aware learning lies the concept of stable embedding, that 
allows the recovery of a vector/signal after projecting it in a lower dimen- 
sional space; this is what all the available conditions, e.g., RIP, guarantee. 
However, no stable embedding is possible with highly coherent dictionar- 
ies. Take as an extreme example the case where the first and second atoms 
are identical. Then no sensing matrix X can achieve a signal recovery that 
distinguishes the vector [1,0,..., 0] T from [0, 1,0,..., 0] T . Can then one con- 
clude that for highly coherent overcomplete dictionaries compressed sensing 
techniques are not possible? Fortunately, the answer to this is negative. Af- 
ter all, our goal in compressed sensing has always been the recovery of the 
signal s = ^>9 and not the identification of the sparse vector in the synthe- 
sis model representation. The latter was just a means to an end. While the 
unique recovery of 6 cannot be guaranteed for highly coherent dictionaries, 
this does not necessarily cause any problems for the recovery of s, using a 
small set of measurement samples. The escape route will come by consid- 
ering the analysis model formulation. However, prior to this treatment, it 
will be of no harm to refresh our basics concerning the theory of frames and 
recall some key definitions. 

14.1 Some Hints from the Theory of Frames 

In order to remain in the same framework as the one already adopted for 
this paper and comply with the notation previously used, we will adhere 
to the real data case, although everything we are going to say is readily 
extended to the complex case, by replacing transposition with its Hermitian 
counterpart. 

A frame in a vector spac^] V C V) is a generalization of the notion 
of a basis. Recall form our linear algebra basics that a basis is a set of 
vectors ipi, i G I, with the following two properties: a) V = spanj^j : 
i G X}, where the cardinality card(X) = I and b) ■j/jj, i 6 I, are mutually 



We constrain our discussion in this section to finite dimensional Euclidean spaces. 
The theory of frames has been developed for general Hilbert spaces. 
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independent. If, in addition, = 5ij then the basis is known as 

orthonormal. If we now relax the second condition and allow I < card(X) := 
p, we introduce redundancy in the signal representations, which, as it has 
already been mentioned, can offer a number of advantages in a wide range 
of applications. However, once redundancy is introduced we lose uniqueness 
in the signal representation 

a = J^0<Vi, (73) 

due to the dependency among the vectors ipi. The question that is now 
raised is whether there is a simple and systematic way to compute the coef- 
ficients 9i in the previous expansion. 

Definition 4. The set ifti, i E I, which spans a vector space, V, is called 
a frame if there exist positive real numbers, A and B, such that for any 
non-zero s G V, 

0<A \\s\\l < | (ipi, s)\ 2 <B \\s\\l , (74) 

where A and B are known as the bounds of the frame. 

Note that if ifti, i 6 X, comprise an orthonormal basis, then A = B = 1 



and (74) is the celebrated Parseval's theorem. Thus, (74) can be considered 
as a generalization of Parseval's theorem. Looking at it more carefully, we 
notice that this is a stability condition that closely resembles our familiar 
RIP condition in ( |36| ). Indeed, the upper bound guarantees that the ex- 
pansion never diverges (this applies to infinite dimensional spaces) and the 
lower bound guarantees that no non-zero vector, ^ 0, will ever become 
zero after projecting it along the atoms of the frame. To look at it from a 
slightly different perspective, form the dictionary matrix 

where we used p to denote the cardinality of X. Then, the lower bound 



in (74) guarantees that s can be reconstructed from its transform samples 
^ T s; note that in such a case, if si ^ 82, then their respective transform 
values will be different. 



It can be shown that if condition ( 74 ) is valid, then there exists another 



set of vectors, tpi, i S X, known as the dual frame, with the following elegant 
property 

s = Y J (^ s )^i = 5^(^i,*>^t, VseV. (75) 

iex iex 

Once a dual frame is available, the coefficients in the expansion of a vector 
in terms of the atoms of a frame are easily obtained. If we form the matrix 
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^ of the dual frame vectors, then it is easily checked out that since condition 
(75) is true for any s, it implies that 

^ T = ^x]/ r = J, (76) 

where / is the I x I identity matrix. Note that all of us have used the 
property in (75 ), possibly in a disguised form, many times in our professional 
life. Indeed, consider the simple case of two independent vectors in the two- 
dimensional space (in order to make things simple). Then, (73) becomes 

S = 9i<lpi + 2 '02 = ^0. 

Solving for the unknown 6 is nothing but the solution of a linear set of 
equations; note that the involved matrix ^ is invertible. Let us rephrase a 
bit our familiar solution 



= ^-! s := § s : = 



1>{ 

r 4 



T 



1# 



S. 



(77) 



where iftf, i = 1,2, are the rows of the inverse matrix. Using now the 
previous notation, it is readily seen that 



1pl,S) Vl + (V>2, S) lf> 2 - 



Moreover, note that in this special case of independent vectors, the respec- 
tive definitions imply 



4 T 



[•01,^2 



and the dual frame is not only unique but it also fulfils the biorthogonality 
condition, i.e., 



^,•05 



(78) 



In the case of a general frame, the dual frames are neither biorthogonal nor 
uniquely defined. The latter can also be verified by the condition ( 76 ) that 
defines the respective matrices. ^> T is a rectangular tall matrix and its left 
inverse is not unique. There is, however, a uniquely defined dual frame, 
known as the canonical dual frame, given as 



iPi ■- (^ T )- 1 tp u or $ := (tf**') -1 *. 



T\-U 



(79) 



Another family of frames of special type are the so-called tight frames. 
For tight frames, the two bounds in (74) are equal, i.e., A = B. Thus, once 
a tight frame is available, we can normalize each vector in the frame as 



A 
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which then results to the so-called Parseval tight frame; the condition (74) 
now becomes similar in appearance with our familiar Parseval's theorem for 
orthonormal bases 



■iex 



(80) 



Moreover, it can be shown that a Parseval tight frame coincides with its 
canonical dual frame (that is, it is self dual) and we can write 



or in matrix form 



* = (81) 

which is similar with what we know for orthonormal bases; however in this 
case, orthogonality does not hold in general. 

We will conclude this subsection with a simple example of a Parseval 
(tight) frame, known as the Mercedes Benz (MB), 
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1 -i 
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V6 J 



One can easily check that all the properties of a Parseval tight frame are 
fulfilled. If constructing a frame, especially in high dimensional spaces, may 
sound a bit like magic, the following theorem (due to Naimark, see, e.g., 
Han 00 1) offers a systematic way for such constructions. 

Theorem 6. A set {i/'ijiex m a Hilbert space % s is a Parseval tight frame, 
if and only if it can be obtained via orthogonal projection, P^ s : H — > T~l s , of 
an orthonormal basis {ej}i £ j in a larger Hilbert space T~L, such that T~L S C %. 

To verify the theorem, check that the MB frame is obtained by orthog- 
onally projecting the three-dimensional orthonormal basis 



ei = 


" 

l 

f 


, e 2 = 


Pi 

V 3 

1 

f 

. V6 . 


, e 3 = 


- l - 

f 
f 










L V3 J 



using the projection matrix 



l l 



"? J 

3 3 



3 J 



Observe that the effect of the projection 

pH s [ei,e 2 ,e 3 ] = 
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is the deletion of the last column in the basis matrix. 

Frames were introduced by Duffin and Schaeffer in their study on non- 



harmonic Fourier series in 1952 |Duff 52 and they remained rather obscured 



till they were used in the context of wavelet theory, e.g., Daub 86 . The 
interested reader can obtain the proofs of what has been said in this section 
from these references. An introductory review with a lot of engineering fla- 



vor can be found in Kova 07 , where the major references in the field are 
given. 

14.2 Compressed Sensing for Signals Sparse in Coherent Dic- 
tionaries 

Our goal in this subsection is to establish conditions that guarantee recovery 
of a signal vector, which accepts a sparse representation in a redundant and 
coherent dictionary, using a small number of signal-related measurements. 
Let the dictionary at hand be a tight frame, Then, our signal vector is 
written as 

s = W, (82) 

where 6 is assumed to be /c-sparse. Recalling the properties of a tight frame, 
as they were summarized in the previous subsection, the coefficients in the 



expansion (82) can be written as (ij)i,s), and the respective vector as 

e = # T s, 

since a tight frame is self dual. Then, the analysis counterpart of the syn- 



thesis formulation in (46) can be cast as 



II T 1 1 1 

min V s L , 

s " 111 

s.t. \\y - Xs\\l<e. (83) 

The goal now is to investigate the accuracy of the recovered solution to this 
convex optimization task. It turns out that similar strong theorems are also 
valid for this problem as with the case of the synthesis formulation, which 
was studied earlier. 

Definition 5. Let be the union of all subspaces spanned by all subsets 
of k columns of A sensing matrix, X, obeys the restricted isometry 
property adapted to ty, (^-RIP) with 5k, if 

(1 - 6 k ) \\s\\ 2 2 < \\Xs\\ 2 2 < (1 + 5 k ) \\s\\l , (84) 



for all s G £ 



k ■ 



The union of subspaces, is the image under of all /c-sparse vectors. 
This is the difference with the RIP definition given in Section 8.2 All the 
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random matrices discussed earlier in this paper can be shown to satisfy 
this form of RIP, with overwhelming probability, provided the number of 
measurements, N, is at least of the order of kln(l/k). We are now ready to 
establish the main theorem concerning our l\ minimization task. 

Theorem 7. Let \& be an arbitrary tight frame and X a sensing matrix that 
satisfies the ^-RIP with 82k < 0.08, for some positive k. Then the solution, 



s*, of the minimization task in (83) satisfies the following property 

\\s - s*|| 2 < C £ri \\^ T s - (^ T s) k \\ l + Ciy/e, (85) 

where Cq,C\ are constants depending on 62k, (^ T s) k denotes the best k- 
sparse approximation of ty T s; i.e., it results by setting all but the k largest 



in magnitude components of ^> T s equal to zero. 



The bound in (85) is the counterpart of that given in (43). In other 
words, the previous theorem states that if *f! T s decays rapidly, then s can 
be reconstructed from just a few (compared to the signal length I) measure- 



ments. The theorem was first given in [Cand 11a and it is the first time that 



such a theorem provides results for the sparse analysis model formulation in 
a general context. 

14.3 Cosparsity 



In Nam 121, the task of sparse analysis modeling was approached via an 



alternative route, employing the tools which were developed in Lu 08b 



Blum 09b for treating general union-of-subspaces models. This comple- 
mentary point of view will also unveil different aspects of the problem by 
contributing to its deeper understanding. We have done it before, where the 
notions of spark, coherence and RIP were all mobilized to shed light from 
different corners to the sparse synthesis modeling task. 

In the sparse synthesis formulation, one searches for a solution in a union 
of subspaces, which are formed by all possible combinations of k columns 
of the dictionary, vp. Our signal vector lies in one of these subspaces; the 
one which is spanned by the columns of VP whose indices lie in the support 



set (Section 11.1). In the sparse analysis approach things get different. The 
kick off point is the sparsity of the transform S ■= § T s, where <3? defines 
the transformation matrix or analysis operator. Since S is assumed to be 
sparse, there exists an index set X such that Mi S X, Si = 0. In other words, 
Vi £ X, <fijs := ((pi, s) = 0, where <j>i stands for the ith column of <£. Hence, 
the subspace in which s lives is the orthogonal complement of the subspace 
formed by those columns of which correspond to a zero in the transform 
vector S. Assume, now, that card(X) = C . The signal, s, can be identified 
by searching the orthogonal complements of the subspaces formed by all 
possible combinations of C Q columns of <£, i.e., 

(0i,a) = O, VfeX. 
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(a) (b) 



Figure 24: Searching for a spare vector s. (a) In the synthesis model, the 
sparse vector lies in subspaces formed by combinations of k (in this case 
k = 2) columns of the dictionary ^. (b) In the analysis model, the sparse 
vector lies in the orthogonal compliment of the subspace formed by C a (in 
this case C Q = 2) columns of the transformation matrix <£. 



The difference between the synthesis and analysis problems is illustrated 



in Fig. 24. To facilitate the theoretical treatment of this new setting, the 



notion of cosparsity was introduced in Nam 121. 



Definition 6. The cosparsity of a signal s £ 1Z 1 with respect to a p x I 
matrix <]? T is defined as 

C o :=p-\\$ T s\\ . (86) 

In words, the cosparsity is the number of zeros in the obtained trans- 
form vector S = & T s; in contrast, the sparsity measures the number of the 
nonzero elements of the respective sparse vector. If one assumes that has 
"full spark'Q i.e., I + 1, then any I of the columns of <£, and thus any I 
rows of $> T are guaranteed to be independent. This indicates that for such 
matrices, the maximum value that cosparsity can take is equal to C = I — 1. 
Otherwise, the existence of I zeros will necessarily correspond to a zero sig- 
nal vector. Higher cosparsity levels are possible, by relaxing the full spark 
requirement. 

Let now the cosparsity of our signal with respect to a matrix <3? T be C . 
Then, in order to dig out the signal from the subspace in which is hidden, 
one must form all possible combinations of C Q columns of $ and search in 
their orthogonal complements. In case that $ is full rank, we have seen 



7 Recall by Def. [2] that spark($) is defined for an I x p matrix "I> with p > I and of full 
rank. 
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previously that C Q < I, and hence any set of C a columns of $ are linearly 
independent. In other words, the dimension of the span of those columns 
is C Q . As a result, the dimensionality of the orthogonal complement, into 
which we search for s, is I — C Q . 

We have by now accumulated enough information to elaborate a bit more 
on the statement made before, concerning the different nature of the syn- 
thesis and analysis tasks. Let us consider a synthesis task using an I x p 
dictionary and let k be the sparsity level in the corresponding expansion of 
a signal in terms of this dictionary. The dimensionality of the subspaces in 
which the solution is sought is k (k is assumed to be less than the spark 
of the respective matrix). Let us keep the same dimensionality for the sub- 
spaces in which we are going to search for a solution in an analysis task. 
Hence, in this case C Q = I — k (assuming a full spark matrix). Also, for 
the sake of comparison assume that the analysis matrix is p X I. Solving 
the synthesis task, one has to search (^) subspaces, while solving the anal- 
ysis task one has to search for subspaces. These are two different 
numbers; assuming that k <C I and also that / < p/2, which are natural 
assumptions for overcomplete dictionaries, then the latter of the two num- 
bers is much larger than the former one (use your computer to play with 
some typical values). In other words, there are much more analysis than 
synthesis low-dimensional subspaces to be searched for. The large number 
of low-dimensional subspaces makes the algorithmic recovery of a solution 
from the analysis model a tougher task, [Nam 12 . 



Another interesting aspect that highlights the difference between the 
two approaches is the following. Assume that the synthesis and analysis 
matrices are related as <3? = ^f, as it was the case for tight frames. Under 
this assumption, provides a set of coefficients for the synthesis expansion 
in terms of the atoms of # = ^. Moreover, if ||<I> r s|| = k, then the 
is a possible fc-sparse solution for synthesis model. However, there is no 
guarantee that this is the sparsest one. 

It is now the time to investigate whether conditions that guarantee 
uniqueness of the solution for the sparse analysis formulation can be de- 
rived. The answer is affirmative and it has been established in |Nam 12 , for 
the case of exact measurements. 

Lemma 5. Let <!> be a transformation matrix of full spark. Then, for almost 
all N x I sensing matrices and for N > 2(1 — C Q ), the equation 

y = Xs, 

has at most one solution with cosparsity at least C Q . 



The above lemma guarantees the uniqueness of the solution, if one exists, 
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of the following optimization 



mm 

s 
S.t. 



l$ T s| 



y = Xs. 



(87) 



However, solving the previous Iq minimization task is a difficult one and 
we know that its synthesis counterpart has been shown to be NP-hard, in 
general. Its relaxed convex relative is the l\ minimization 



mm 

s 
S.t. 



y 



i 

Xs. 



In Nam 12 , conditions are derived that guarantee the equivalence of the Iq 



and l\ tasks, in (87) and (88), respectively; this is done in a way similar to 



that for the sparse synthesis modeling. Also, in |Nam 12 , a greedy algorithm 
inspired by the Orthogonal Matching Pursuit, discussed in Section [ll.l[ has 
been derived. Other algorithms that solve the l\ optimization in the analysis 
modeling framework can be found in, e.g., [Cai 09aplad05|Sele 09] . NESTA 
can also be used for the analysis formulation. 



15 A Case Study: Time- Frequency Analysis 

The goal of this section is to demonstrate how all the previously stated the- 
oretical findings can be exploited in the context of a real application. Sparse 
modelling has been applied to almost everything. So, picking up a typical 
application would not be easy. We preferred to focus on a less "publicised" 
application; that of analysing echolocation signals emitted by bats. How- 
ever, the analysis will take place within the framework of time-frequency 
representation, which is one of the research areas that significantly inspired 
the evolution of compressed sensing theory. Time- Frequency analysis of sig- 
nals has been the field of intense research for a number of decades, and it 
is one of the most powerful signal processing tools. Typical applications in- 
clude speech processing, sonar sounding, communications, biological signals, 
EEG processing, to name but a few, see, e.g., |Boas 03 , Flan 99| . 



15.0.1 Gabor Transform and Frames 

It is not our intention to present the theory behind the Gabor transform. 
Our goal is to outline some basic related notions and use it as a vehicle for 
the less familiar reader so that a) to better understand how redundant dictio- 
naries are used and b) get more acquainted with their potential performance 
benefits. 

The Gabor transform was introduced in the middle 1940s by Dennis Ga- 
bor (1900-1979), who was a Hungarian-British engineer. His most notable 
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(a) 





Time (samples) 

(b) 

Figure 25: (a) The Gaussian window with spreading factor a centered at 
time instant m. (b) Pulses obtained by windowing three different sinusoids 
with Gaussian windows of different spread and applied at different time 
instants. 



scientific achievement was the invention of holography, for which he won the 
Nobel prize for Physics in 1971. The discrete version of the Gabor transform 
can be seen as a special case of the Short Time Fourier Transform (STFT), 
e.g., |Mall 08 Flan 99 . In the standard DFT transform, the full length of a 



time sequence, comprising I samples, is used all in "one-go" in order to com- 
pute the corresponding frequency content. However, the latter can be time 
varying, so the DFT will provide an average information, which cannot be of 
much use. The Gabor transform (and the STFT in general) introduces time 
localization via the use of a window function, which slides along the signal 
segment in time, and at each time instant focusses on a different part of the 
signal; this is a way that allows one to follow the slow time variations, which 
take place in the frequency domain. The time localization in the context of 
the Gabor transform is achieved via a Gaussian window function, i.e., 

n 2 \ 



Fig. 25 1 shows the Gaussian window, g(n — m), centered at time instant m. 
The choice of the window spreading factor, cr, will be discussed later on. 
Let us now construct the atoms of the Gabor dictionary. Recall that 



in the case of the signal representation in terms of the DFT in ( 71 ) , each 



frequency is represented only once, by the corresponding sampled sinusoid, 



(72). In the Gabor transform, each frequency appears I times; the corre- 
sponding sampled sinusoid is multiplied by the Gaussian window sequence, 
each time shifted by one sample. Thus, at the zth frequency bin, we have I 
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Time 

Figure 26: Each atom of the Gabor dictionary corresponds to a node in 
the time-frequency grid. That is, it is a sampled windowed sinusoid whose 
frequency and location in time are given by the coordinates of the respective 
node. In practice, this grid may be subsampled by factors a and j3 for the 
two axes respectively, in order to reduce the number of the involved atoms. 



atoms, g( m,% \m = 0, 1, — 1, with elements given by 

g^(n)=g(n-m)Mn), n, m, i = 0, 1, . . . , I - 1, (90) 



where ipi{ n ) ls the nth element of the vector ipi in (72). This results to 
an overcomplete dictionary comprising I 2 atoms in the Z-dimensional space. 
Fig. [25b illustrates the effect of multiplying different sinusoids with Gaussian 



pulses of different spread and at different time delays. Fig. 26 is a graphical 
interpretation of the atoms involved in the Gabor dictionary. Each node, 
(m,i), in this time-frequency plot, corresponds to an atom of frequency 
equal to j^i and delay equal to m. 

Note that the windowing of a signal of finite duration inevitably intro- 
duces boundary effects, especially when the delay m gets close to the time 
segment edges, and I — 1. A solution to it, that facilitates the theoretical 
analysis, is to use a modulo I arithmetic to wrap around at the edge points 



(this is equivalent with extending the signal periodically), see, e.g., Stro 98| . 

Once the atoms have been defined, they can be stacked one next to the 
other to form the columns of the I x Z 2 Gabor dictionary, G. It can be shown 
that the Gabor dictionary is a tight frame, |Zibu 94| . 



15. A CASE STUDY: TIME-FREQUENCY ANALYSIS 



98 





Time 



Figure 27: The shorter the width of the pulsed (windowed) sinusoid is in 
time the wider the spread of its frequency content around the frequency of 
the sinusoid. The Gaussian-like curves along the frequency axis indicate the 
energy spread in frequency of the respective pulses. The values of at and aj 
indicate the spread in time and frequency, respectively. 



15.0.2 Time-Frequency Resolution 

By the definition of the Gabor dictionary, it is readily understood that the 
choice of the window spread, as measured by a, must be a critical factor, 
since it controls the localization in time. As it is known from our Fourier 
transform basics, when the pulse becomes short, in order to increase the 
time resolution, its corresponding frequency content spreads out, and vice 
versa. From Heisenberg's principle, we know that we can never achieve high 
time and frequency resolution, simultaneously; one is gained at the expense 
of the other. It is here where the Gaussian shape in the Gabor transform 
is justified. It can be shown that the Gaussian window gives the optimal 



trade-off between time and frequency resolution, [Mall 08 



Flan 99 . The 



time-frequency resolution trade-off is demonstrated in Fig. 27 where three 
sinusoids are shown windowed with different pulse durations. The diagram 
shows the corresponding spread in the time- frequency plot. The value of 
at indicates the time spread and aj the spread of the respective frequency 
content around the basic frequency of each sinusoid. 

15.0.3 Gabor Frames 

In practice, I 2 can take large values and it is desirable to see whether one 
can reduce the number of the involved atoms, without sacrificing the frame- 
related properties. This can be achieved by an appropriate subsampling, as 
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this is illustrated in Fig. [26] , We only keep the atoms that correspond to 
the red nodes. That is, we subsample by keeping every a nodes in time and 
every (3 nodes in frequency in order to form the dictionary, i.e., 

G {a ,f 3) = {g {ma ^ ) }, m = 0,l,...,--l, i = 0,l,...,^-l, 

where a and /? are divisors of /. Then, it can be shown, e.g., ( |Flan 99] ) , that 
if a (3 < I the resulting dictionary retains its frame properties. Once G^ a ^ is 



obtained, the canonical dual frame is readily available via (79) (adjusted for 
complex data), from which the corresponding set of expansion coefficients, 
9, results. 

15.0.4 Time-Frequency Analysis of Echolocation Signals Emitted 
by Bats 

Bats are using echolocation for navigation (flying around at night), for prey 
detection (small insects) and for prey approaching and catching; each bat 
adaptively changes the shape and frequency content of its calls in order to 
better serve the previous tasks. Echolocation is used in a similar way for 
sonars. Bats emit calls as they fly, and "listen" to the returning echoes in 
order to build up a sonic map of their surroundings. In this way, bats can 
infer on the distance, the size of obstacles as well as of other flying crea- 
tures/insects. Moreover, all bats emit special types of calls, called social 
calls, which are used for socializing, flirting, etc. The fundamental charac- 
teristics of the echolocation calls, as for example, the frequency range and 
the average time duration, differ from species to species since, thanks to 
evolution, bats have adapted their calls in order to get best suited to the 
environment in which a species operates. 

Time- Frequency analysis of echolocation calls provides information about 
the species (species identification) as well as of the specific task and be- 
haviour of the bats in certain environments. Moreover, the bat-biosonar 
system is studied in order humans to learn more about nature and be in- 
spired for subsequent advances in applications such as sonar navigation sys- 
tems, radars, medical ultrasonic devices, etc. 



Fig. 28 1 shows a case of a recorded echolocation signal from bats. Zoom- 
ing at two different parts of the signal, we can observe that the frequency is 
changing with time. In Fig. [28p , the DFT of the signal is shown, but there 
is no much information that can be drawn from it, except that the signal 
is compressible in the frequency domain; most of the activity takes place 
within a short range of frequencies. 

Our echolocation signal was a recording of total length T = 21.845msecs, 



|Kops 10 . Samples were taken at the sampling frequency f s = 750 KHz, 



which results in a total of I = 16384 samples. Although the signal itself is 
not sparse in the time domain, we will take advantage of the fact that it is 
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Figure 28: (a) The recorded echo-location signal. The frequency of the signal 
is time varying and this is indicated by focussing on two different parts of the 
signal, (b) Plot of the energy of the DFT transform coefficients, Si. Observe 
that most of the frequency activity takes place within a short frequency 
range. 
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sparse in a transformed domain. We will assume that the signal is sparse in 
its expansion in terms of the Gabor dictionary. 

Our goal in this example is to demonstrate that one does not really need 
all 16384 samples to perform time- frequency analysis; all the processing can 
be carried out by using a reduced number of measurements, by exploiting 
the theory of compressed sensing. To form the measurements vector, y, the 
number of measurements was chosen to be N = 2048. This amounts to a 
reduction of eight times with respect to the number of available samples. 
The measurements vector was formed as 

y = Xs, 

where X is a N x I sensing matrix comprising ±1 generated in a random way. 
This means that once we obtain y, one does not need to store the original 
samples any more, leading to a saving in memory. Ideally, one could have 
obtained the reduced number of measurements by sampling directly the 
analogue signal at sub-Nyquist rates, as it has already been discussed at the 



end of Section 10 Another goal is to use both the analysis and synthesis 
models and demonstrate their difference. 

Three different spectrograms were computed. Two of them, shown in 
Figs. [29p and [29b, correspond to the reconstructed signals obtained by the 



analysis, (88), and the synthesis, (44), formulations, respectively. In both 



cases, the NESTA algorithm was used and the G( 12 s,64) frame was employed. 
Note that the latter dictionary is redundant by a factor of 2. The spectro- 
grams are the result of plotting the time-frequency grid and colouring each 
node (t, i) according to the energy \9\ 2 of the coefficient associated with the 
respective atom in the Gabor dictionary. The full Gabor transform applied 
on the reconstructed signals to obtain the spectrograms, in order to get a 
better coverage of the time-frequency grid. The scale is logarithmic and 
the darker areas correspond to larger values. The spectrogram of the orig- 



inal signal obtained via the full Gabor transform is shown in Fig. 291. It 
is evident, that the analysis model resulted in a more clear spectrogram, 
which resembles the original one better. When the frame G(64,32) is em ~ 
ployed, which is a highly redundant Gabor dictionary comprising 8/ atoms, 
then the analysis model results in a recovered signal whose spectrogram is 
visually indistinguishable from the original one in Fig. [29H. 



Fig. 29 i is the plot of the magnitude of the corresponding Gabor trans- 
form coefficients, sorted in decreasing values. The synthesis model provides 
a sparser representation, in the sense that the coefficients decrease much 
faster. The third curve is the one that results if we multiply the dual frame 
matrix G , ( 12 8,64) directly with the vector of the original signal samples and 
it is shown for comparison reasons. 

To conclude, the curious reader may wonder what do these curves in 



Fig. 291 mean after all. The call denoted by (A) belongs to a Pipistrellus 
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Figure 29: (a) Plot of the magnitude of the coefficients, sorted in decreasing 
order, in the expansion in terms of the G(i28,64) Gabor frame. The results 
correspond to the analysis and synthesis model formulations. The third 
curve corresponds to the case of analysing the original vector signal directly, 
by projecting it on the dual frame, (b) The spectrogram from the analysis 
and (c) the spectrogram from the synthesis formulations, respectively, (d) 
The spectrogram corresponding to (7(64,32) frame using the analysis formu- 
lation. For all cases, the number of measurements used was one eighth of 
the total number of signal samples. A, B and C indicate different parts of 
the signal, as explained in the text. 



pipistrellus (!) and the call denoted by (B) is either a social call or belongs to 
a different species. The (C) is the return echo from the signal (A). The large 
spread in time of (C) indicates a highly reflective environment, |Kops 10 . 

16 From Sparse Vectors to Low Rank Matrices: A 
highlight 

In this section, we move beyond sparse vectors and our goal is to investigate 
if and how notions related to sparsity can be generalized to matrices. We will 
see that such a generalization builds upon linear algebra tools and notions 
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related to SVD decomposition, low rank approximation and dimensionality 
reduction. Our goal is to simply highlight the basic concepts and definitions 
without delving into a deeper treatment. Our aim is to make the reader 
alert of the problems and their potential for applications. 



16.1 Matrix Completion 



Consider a signal vector s £ 1Z 1 , where only N of its components are ob- 
served and the rest are unknown. This is equivalent with sensing s via a 
sensing matrix X having its ./V rows picked uniformly at random from the 
standard (canonical) basis $ = I, where I is the I x I identity matrix. The 
question which is now posed is whether it is possible to recover the missing 
components of s based on these N components. From the theory presented, 
so far, we know that one can recover all the components of s, provided that 
s is sparse in some basis or dictionary, ^f, which exhibits low mutual co- 
herence with <P = I, and N is large enough, as it has been pointed out in 
Section [TUl 

Inspired by the theoretical advances in Compressed Sensing, a question 
similar in flavor and with a prominent impact regarding practical applica- 
tions was posed in Cand 09] . Given a l± X I2 matrix M, assume that only 
N « l\l2 among its entries are known. The question now is whether one 
is able to recover the exact full matrix. This problem is widely known as 
matrix completion iCand 09 . The answer, although it might come as a 



surprise, is "yes" with high probability, provided that a) the matrix is well 
structured, b) it has a low rank, r « I, where / = min(7i,Z2), and c) that 
./V is large enough. Intuitively, this is plausible because a low rank matrix 
is fully described in terms of a number of parameters (degrees of freedom), 
which is much smaller than its total number of entries. These parameters 
are revealed via its Singular Value Decomposition (SVD) 



M 



E 



OiUiV; 



U 



01 











0~r 



V 1 



(91) 



where r is the rank of the matrix, Wj 6 V} 1 and Vi 6 1Z l2 , i = 1, 2, . . . , r, are 
the left and right orthonormal singular vectors, spanning the column and 
row spaces of M respectively, Oi, i = 1, 2, , . . . , r, are the corresponding 
singular values and U = [u\, U2, • • • , u r ], V = [t>i, i>2, • • ■ ,v r ]. 

Let ctm denote the vector containing all the singular values of M, i.e., 
,T , then rank(M) := ||<tjv/|| . Counting the parameters 



\o~i, 02, 



.07 



associated with the singular values and vectors in (91) turns out that the 
number of degrees of freedom of a rank r matrix is equal to oIm = r(h + 
h) — Cand 10a . When r is small, du is much smaller than I. 

Let us denote with fi the set of N pairs of indexes, (i, j), i = 1, ,2, . . . ,li, 
j = 1, , 2, . . . , I2, of the locations of the known entries of M, which have been 
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sampled uniformly at random. Adopting the main reasoning followed so far, 
one would attempt to recover M based on the following rank minimization 
problem 



mm WMo 



s.t. M id = M i:j , (i,j)en. (92) 
It turns out that, assuming that there exist a unique low-rank matrix having 



the specific known entries, then (92) leads to the exact solution Cand 09] . 
However, compared to the case of sparse vectors, in the matrix completion 
problem the uniqueness issue gets much more involved. The following issues 



play a crucial part concerning the uniqueness of the task in (92). 



1. If the number of known entries is lower than the degrees of freedom, 
i.e., ./V < d,M, then there is no way to recover the missing entries 
whatsoever, since there is an infinite number of low rank matrices 
consistent with the N observed entries. 

2. Even if N > cIm, uniqueness is still not guaranteed. It is required 
that the ./V elements with indices in are such that at least one entry 
per column and one entry per row is observed. Otherwise, even a 
rank-1 matrix, i.e, M = o~\Uivf , is not possible to be recovered. This 
becomes clear with a simple example. Assume that M is a rank-1 
matrix and that no entry in the first column as well as in the last row 
is observed. Then, since for this case M(i, j) = o~iuwvij, it is clear that 
no information concerning the first component of v\ as well as the last 
component of u\ is available; hence these singular vector components 
are impossible to be recovered, regardless which method is used. As 
a consequence, the matrix can not be completed. On the other hand, 
if the elements of $7 are picked at random and N is large enough, 
one can only hope that f2 is such that to comply with the previous 
requirement; i.e., at least one entry per row and column is observed, 
with high probability. It turns out that this problem resembles the 
famous in probability theory theorem known as the coupon collector's 
problem. According to this, at least N = CqI In I entries are needed, 



where Co is a constant [IMotw 95 . This is the information theoretic 



limit for exact matrix completion Cand 10a of any low-rank matrix. 



Even if points (1) and (2) before are fulfilled, still uniqueness is not 
guaranteed. In fact, not every low rank matrix is liable to exact com- 
pletion, regardless of the number and the positions of the observed 
entries. We will demonstrate that with the aid of an example. Let 
one of the singular vectors be sparse. Assume, without loss of gen- 
erality, that the third left singular vector, 113, is sparse with sparsity 
level k = 1 and also that its nonzero component is the first one, i.e., 
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u 3i 7^ 0. The rest of U{ and all Uj are assumed to be dense. Let us 
return to the SVD for a while, and specifically to the leftmost formula 



given in (91). Observe that the matrix M is written as the sum of 
r, l\ x ?2 matrices (TiU{vJ ', i = l,...,r. Thus, in this specific case 
where u% is k = 1 sparse, the matrix a^u^v^ has zeros everywhere 
except from its first row. In other words, the information that 1731*31;^ 
brings to the formation of M is concentrated to its first row only. This 
argument can also be viewed from another perspective; the entries of 
M obtained from any row but the first one, do not provide any useful 
information with respect to the values of the free parameters 03, W3, 
V3. As a result, in this case, unless if one incorporates extra informa- 
tion about the sparse nature of the singular vector, the entries from 
the first row that are missed are not recoverable, since the number of 
parameters concerning this row is larger than the available number of 
data. 

Intuitively, when a matrix has dense singular vectors is better rendered 
for exact completion, since each one among the observed entries carries 
information associated with all the du parameters that fully describe 
it. To this end, a number of conditions, which evaluate the suitability 
of the singular vectors, have been established. The simplest one is 
given next [Cand 09] : 

1 = 1,..., r. (93) 

where /i^ is a bound parameter. In fact, /15 is a measure of the 
coherenc^] of matrix U (and similarly of V), (vis-a-vis the standard 
basis), defined as follows: 

u(U) := — max WPueAl 2 , (94) 

r l<i<h 

where Py defines the orthogonal projection to subspace U and ej is 

1 1 2 

the ith vector of the canonical basis. It is easy to show that ||_P{/ej|| = 
||Z7 T e;|| . In essence, coherence is an index quantifying the extent to 
which the singular vectors are correlated with the standard basis, ej, 
% = l,2,...,l. The smaller the is the less "spiky" the singular 
vectors are likely to be, and the corresponding matrix is better suited 
for exact completion. Indeed, assuming for simplicity a square matrix 
M, i.e. l\ = I2 = I, then if any one among the singular vectors 
is sparse having a single nonzero component only, then, taking into 
account that ujxii = vji)i = 1, this value will have magnitude equal 




8 This is a quantity different than the mutual-coherence already discussed in section 

El 
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to one and the bound parameter will take its largest value possible, 
i.e., |Ub = I. On the other hand, the smaller value that fiB can get is 1, 
something that occurs when the components of all the singular vectors 
assume the same value (in magnitude). Note that in this case, due to 
the normalization, this common component value has magnitude \. 
Tighter bounds to the matrix coherence result via the more elaborate 



incoherence property [Cand 09 , Rech 11 and the strong incoherence 



property |Cand 10a . In all cases, the larger the bound parameter is 
the larger the number of known entries, which is required in order to 
guarantee uniqueness, becomes. 



In section |16.3[ the aspects of uniqueness will be discussed in the 
context of a real life application. 



The problem described in ( 92 ) is of limited practical interest since it is 



an NP-hard task. Thus, following the Compressed Sensing paradigm, it is 
replaced by a convexly relaxed counterpart of it, i.e., 



mm 



Mill 



S.t. 



M, 
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M, 



1,3 ■> 



(95) 



where i- e -> the sum of the singular values, is referred to as nu- 



clear norm of the matrix M, often denoted as 



M 



The nuclear norm 



minimization was proposed in Faze 01 as a convex approximation of rank 



minimization, which can be cast as a semidefinite program. 

Theorem 8. Let M be a l\ x li matrix of rank r, which is a constant 
much smaller than I = max(Zi,/2)j obeying (93). Suppose that we observe 
N entries of M with locations sampled uniformly at random. Then there is 
a positive constant C such that if 



N > C/j%lln 2 l, 



(96) 



then M is the unique solution to (95) with probability at leat 1 — / 



;-3 



There might be an ambiguity on how small the rank should be in order 
for the corresponding matrix to be characterized as "low rank". More rig- 
orously, a matrix is said to be of low rank if r = 0(1), which means that 
r is a constant with no dependence (not even logarithmic), on I. Matrix 
completion is also possible for more general rank cases where instead of the 



Cand 09, Cand 10a 



mild coherence property of ( 93 ) , the incoherence and the strong incoherence 
properties 



Rech 11 , Gros 11| are mobilized in order to 



get similar theoretical guaranties. The detailed exposition of these alterna- 
tives is out of the scope of this paper. In fact, Theorem [8] embodies the 
essence of the matrix completion task: with high probability, nuclear-norm 
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minimization recovers all the entries of a low rank matrix M with no er- 
ror. More importantly, the number of entries N, which the convexly relaxed 
problem requires, is only by a logarithmic factor larger than the information 
theoretic limit; recall that the latter is equal to CqI In I. Moreover, similarly 
to Compressed Sensing, robust matrix completion in the present of noise 
is also possible as long as th e request M jj = Mjj in (92) and (95) is re- 
placed by 



.1/ 



< e 



Cand 10b 



Furthermore, the notion of matrix 



completion has also been extended to tensors |Gand 11 Sign ll] . 



16.2 Robust PCA 

The developments on matrix completion theory led, more recently, to the 
formulation and solution of another problem of high significance. To this 
end, the notation H-M^, i.e., the l\ norm of a matrix, is introduced and 
it is defined as the sum of the absolute values of its entries, i.e., II MIL = 



Yli=i Ylj=i In other words, it acts on the matrix as if this were a 

long vector. Assume that M is expressed as the sum of a low rank matrix, 
L, and a sparse matrix, S, i.e., M = L + S . The following convex minimiza- 
tion problem [Cand lfb Wrig 09a, Chan 11 , usually referred to as principal 
component pursuit (PCP), 



mm 



2 

s.t. 



||o"m||i + a \\SWi 
L + S = M, 



(97) 



is shown to recover both L and S according to the following theorem Cand lib] 



Theorem 9. The PCP recovers both L and S with probability at least 
1 — c/j -10 , where c is a constant, provided that: 

1. the support set £1 of S is uniformly distributed among all sets of car- 
dinality N, 

2. the number, k, of nonzero entries of S is relatively small, i.e., k < phh, 
where p is a sufficiently small positive constant, 

3. L obeys the incoherence property, 

4. the regularization parameter, A, is constant with value A = —^=, 

V'2 



5. rank(L) < C £, > with C being a constant. 



In other words, based on all the entries of a matrix M, which is known 
that is the sum of two unknown matrices L and S, with the first one being 
of low rank matrix and the second being sparse, then PCP recovers exactly, 
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with probability almost 1, both L and S, irrespective of how large the mag- 
nitude of the entries of S are, provided that both r and k are sufficiently 
small. 

The applicability of the previous task is very broad. For example, PCP 
can be employed in order to find a low rank approximation of M . It is 
well known that the task of low rank approximation is closely related to 
the dimensionality reduction task, where the columns of M are expressed 
in terms of the r (principal components) columns of U, e.g., Chapter 6, 
|Theo 09 . However, in contrast to the standard SVD or PCA approach, PCP 



is robust and insensitive to the presence of outliers, since these are naturally 
modeled, via the presence of S. For this reason, the above task is widely 
known as robust PCA via nuclear norm minimization. (More classical PCA 
techniques are known to be sensitive to outliers and a number of alternative 
approaches have in the past been proposed towards its robustification, e.g., 
|Karh 95[pCu~95l|Torr 03[|Hube 04[|Hube 05] ) . 



When PCP serves as a robust PCA approach, the matrix of interest is 
L and S accounts for the outliers. However, PCP provides estimates for 
both L and S. As it will be discussed in the next subsection, state-of-the- 
art applications are well accommodated when the focus of interest is turned 
into the sparse matrix S itself. 

Remarks 13. 

• Just as l\ -minimization is the tightest convex relaxation of the combi- 
natorial £o _m i mm i za tion problem in compressed sensing, the nuclear- 
norm minimization is the tightest convex relaxation of the NP-hard 
rank minimization problem; i.e., the nuclear ball {M £ j^h^h ■ \\M\\ < 
1} is the convex hull of the set of rank-one matrices with spectral norm 
bounded by one. Besides the Nuclear norm, other heuristics have also 
been proposed, such as the log-det heuristic [Faze 04| and the max- 
norm 



Foyg 11 



The nuclear norm, as a rank minimization approach, is the general- 
ization of the trace-related cost, which is often used in the control 
community for the rank minimization of positive semidefinite matri- 



ces [Mesb 97 . Indeed, when the matrix is symmetric and positive 
semidefinite, the nuclear norm of M is the sum of the eigenvalues and 
thus it is equal to the trace of M. Such problems arise when, for ex- 
ample, the rank minimization task refers to covariance matrices and 
positive semidefinite Toeplitz or Hankel matrices (see, e.g., |Faze 04) ) . 



Both matrix completion (95) and PCP (97) can be formulated as 
semidefinite programs and are solved via mobilizing interior-point meth- 
ods. However, whenever the size of a matrix becomes large (e.g., 
100 x 100), these methods are deemed to fail in practice due to excessive 
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power and memory requirements. As a result, there is an increasing 
interest, which has propelled intensive research efforts, for the develop- 
ment of efficient methods to solve (95), (97) or related approximations, 
which scale well with large matrices. Many of these methods revolve 
around the philosophy of the iterative soft and hard thresholding tech- 
niques, as discussed in previous sections. However, in the current low 
rank approximation setting, it is the singular values of the estimated 
matrix which are thresholded. As a result, in each iteration, the es- 
timated matrix, after thresholding its singular values, tends to be of 
lower rank. The thesholding of the singular values is either imposed, 
such as in the case of the singular value thresholding (SVT) algorithm 



Cai 10 or it results as a solution of the regularized versions of (95) 



and (j97j) (see, e.g., [Toh 10[pIeTi~TIj|Lin 10[|Gane 09[|Yuan 09| ). More- 
over, algorithms inspired by greedy methods such as CoSaMP, have 
also been proposed |Lee 10 Wate 11| . 

Further developments on robust PCA led to improved versions |Gane 10] 
allowing for exact recovery even though the number of nonzero entries 
of S approaches hfa arbitrarily close, provided that the sign pattern 
of S is random. Furthermore, even full columns are allowed to be 
corrupted |Xu 12 , McCo 11 . Moreover, fusions of PCP with matrix 
completion and Compressed Sensing are possible, in the sense that 
only a subset of the entries of M is available and/or linear measure- 
ments of the matrix in a Compressed Sensing fashion can be used 
instead of matrix entries (see, e.g., |Wate 11 Wrig 12] ). Moreover, 
stable versions of PCP dealing with noise have also been investigated 
(see, e.g., |Zhou 10| ). 



16.3 Applications of Matrix Completion and PCP 

The number of applications in which these techniques are involved is ever 
increasing and their extensive presentation is out of the scope of this paper. 
Next, some key applications will be selectively discussed since they reveal 
the potential of these methods and at the same time will assist the reader 
for a better understanding of the underline notions. 



16.3.1 Matrix Completion 

A typical application, where the matrix completion problem arises, is in the 
collaborative filtering task |Su 09| , which is essential for building up suc- 
cessful recommender systems. Let us consider that a group of individuals 
provide their ratings concerning products, which they have enjoyed. Then 
a matrix with ratings can be rilled where each row indexes a different in- 
dividual and the columns index the products. As a popular example take 
the case where the products are different movies. Inevitably, the associated 
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matrix will be partially filled, since it is not common that all customers 
have watched all the movies and submit ratings for all of them. Matrix 
completion comes to provide an answer, potentially in the affirmative, to 
the following question. Can we predict the ratings that the users would give 
to films that they have not seen yet? This is the task of a recommender 
system in order to encourage users to watch movies, which are likely to be 
of their preference. The exact objective of competition for the famous Net- 
flix prize ( |http : / /www . netf lixprize . com/| ) was the development of such 
a recommender system. 

The aforementioned problem provides a good opportunity to build up our 
intuition about the matrix completion task. Fist, an individual's preferences 
or taste on movies are typically governed by a small number of factors, such 
as the gender, the actors they play in it, the continent of origin, etc. As a 
result, a matrix fully filled with ratings is expected to be low rank. Moreover, 
it is clear that each user need to have at least one movie rated in order to 
have any hope to fill out his/her ratings across all movies. The same is true 
for each movie. This requirement complies with the second requirement in 



section 16.1 concerning uniqueness, i.e., one needs to know at least one 
entry per row and column. Finally, imagine a single user who rates movies 
with criteria that are completely different to those used by the rest of the 
users. He/She could, for example, provide ratings at random or depending 
on, let us say, the first letter of the movie title. Such a scenario complies 
with the third point concerning the uniqueness in the matrix completion 
problem, as previously discussed. Unless all the ratings of the specific user 
are known, the matrix cannot get fully completed. 

In the previous application, the matrix of interest can be characterized 
as approximately low rank. In other cases, such as in sensor network lo- 
calization |Mao 07] , the rank of the matrix assumes an exact value. The 
goal of localization is to assign geographic coordinates to each node in the 
sensor network based on a square matrix, which contains the pairwise dis- 



tances between the nodes |Bisw 06 Mont 10 . It turns out that this matrix 



is of very low rank, e.g., two or three, depending on whether the sensors are 
placed in the 2D or 3D space. As a result, matrix completion is possible 
using a limited number of distance measurements. The number of distance 
measurements is reduced either intentionally, in order to save power and/or 
due to the presence of irregularities and obstacles in the deployment area, 
which renders the communication among nodes impossible. Other applica- 
tions of matrix completion includes system identification [Liu 10 , recovering 



structure from motion |Chen 04] and multi-task learning Argy 07 



16.3.2 Robust PCA/PCP 

In the collaborative filtering task, robust PCA offers an extra attribute com- 
pared to matrix completion, which can be proved very crucial in practice. 
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The users are allowed to even tamper with some of the ratings without af- 
fecting the estimation of the low rank matrix. This seems to be the case 
whenever the rating process involves many individuals in an environment, 
which is not strictly controlled, since some of them occasionally are expected 
to provide ratings in an ad-hoc, or even malicious manner. 

One of the first applications of PCP was in video surveillance systems 
|Cand lib and the main idea behind it appeared to be popular and ex- 
tendable to a number of computer vision applications. Take the example 
of a camera recording a sequence of frames consisting of a merely static 
background and a foreground with few moving objects, e.g., vehicles and/or 
individuals. A common task in surveillance video is to extract from the 
background the foreground, in order, for example, to detect any activity or 
to proceed with further processing such as face recognition. Suppose the 
successive frames are converted to vectors in lexicographic order and then 
are placed as columns in a matrix M. Due to the background, even though 
this may slightly vary due to changes in illumination, successive columns 
are expected to be highly correlated. As a result, the background can be 
modeled as an approximately low rank matrix L. On the other hand, the 
objects on the foreground, appear as "anomalies" concerning a fraction of 
pixels of each frame, i.e., a limited number of entries in each column of M. 
Moreover, due to the motion of the forground objects, the positions of these 
anomalies are likely to change from one column of M to the next. Therefore, 
they can be modeled as a sparse matrix S. Note that in this application, 
the matrix of interest is the sparse matrix rather than the low rank one. 



17 Conclusions 

In this paper, we provided an overview of the major theoretical advances as 
well as the main trends in algorithmic developments in the area of sparsity- 
aware learning and compressed sensing. Both batch processing and online 
processing techniques were considered. A case study in the context of time- 
frequency analysis of signals was also presented. Our intent is to update 
the review from time to time, since this is a very hot research area with a 
momentum and speed that is sometimes difficult to follow up. 
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A Appendix 

The stage of our discussion in this Appendix is the real Euclidean space TZ , 
where I is a positive integer. Although all of the following arguments hold 
true even in the case where the TZ 1 is substituted by the much more general 
Hilbert space setting, we confine ourselves here, for the sake of simplicity, to 
the Euclidean space. Henceforth, the space TZ 1 is considered to be equipped 
with an inner product, which, in the present context, is denoted by (61, 2 ), 
V0i, 62 € TZ 1 - A standard example of such an inner product is the classical 
vector/dot one, defined by (61,62} := 0f0 2 , ¥61,62 G TZ 1 , where the su- 
perscript (■) T stands for vector transposition. Another example of an inner 
product for the space TZ 1 is the following weighted one; (&i, 62) ■= 6JW62, 
¥61, 6 2 G TZ 1 , where W G 7Z lxl is any user-defined positive definite matrix. 
In order not to spare the generality of the following discussion, we let (•, •) 
stand for any user-defined inner product on the linear space TZ 1 . Given such 
an inner product, the associated norm is induced according to the following 
rule: ||-|| := yj (■, •). Excellent resources for a deeper study on the extremely 



rich subject of convex analysis are [Rock 70, Rock 04 Hiri Ol Baus 11 



We start, now, with few notions of fundamental importance to convex 
analysis. 

A.l Closed convex sets and metric projection mappings 

Definition 7 (Convex set, convex function). A non-empty subset C of TZ 1 
is called convex if V0i, 62 G TZ 1 , and VA G [0, 1], the following holds true: 
A0i + (1 - A)0 2 6 (7. 

Moreover, a function C : TZ 1 — > TZ is called convex if V0i,0 2 G TZ 1 , and 
VA G [0, 1], £(A0i + (1 - A)0 2 ) < A£(0i) + (1 - A)£(0 2 ). The function C is 
called strictly convex if VA G (0, 1) and V0i, 62 G TZ 1 , such that 61 ^ 62, we 
have £(A0i + (1 - A)0 2 ) < A£(0i) + (1 - A)£(0 2 ). 

The epigraph of a function £ is defined as the set 

epi(£) := {(0,r) G TZ 1 X TZ : C{6) < r}. 

In other words, the epigraph of C is the set of all points of TZ 1 x TZ which 
belong to and lie above the graph of C. Notice, also, by the definition of 
convexity, that C is convex if and only if epi(£) is convex. 

Given a real number £, the lower level set of C at height £ is defined as 
the set 

lev< ? (£) := {6 G TZ 1 : C(6) < £}. 



For the geometry behind the previous definitions, see Fig. 30 



Definition 8 (The metric projection mapping). Given a non-empty closed 
convex set C C TZ 1 , the metric projection mapping onto C is defined as the 
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A£(0i) + (1 - A)£(0. 




Figure 30: A convex function £, its epigraph, and the lower level set of C 
at height 0. 

operator that maps to each 6 G TZ l the unique Pc{@) £ C such that 

\\0-P c (0)\\ = d(0,C). 

In other words, the point Pc{9) is the unique minimizer of the function 
\\6 — £c||, x G C. Obviously, in the case where G C, then Pc(6) = 

As an example, the metric projection mapping onto the hyperslab is given 
next. 




0. 


all" 







if | (a, 8) - c\ < e 
I 35 *, if (a, 8) < c — e 



Figure 31: A hyperslab and its associated projection mapping. 



Example 8 (Hyperslab). A hyperslab S[e] is the closed convex subset of 
1Z 1 which is defined as 

S[e] := {x eTZ 1 : I (a,x) - c\ < e}, 
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for some nonzero a E 1Z 1 and some c 6 The projection mapping Pgr e 
onto S[e] is given as follows: 



^s [e ](0) = 0- 



0, if | (a,0) - c| < e, 

^g^a, if(a,0)<c-e. 



(98) 



For the related geometry, see Fig. 31 . Notice that a stands for the normal 
vector defining the hyperplanes associated with the hyperslab. 

A.2 The subgradient 



C(0) 




02 



0i 



Figure 32: The graph of C and supporting hyperplanes generated by the 
subgradients at points, 0±, 02, where the function is differentiable and non- 
differentiable, respectively. 



Definition 9 (Subgradient, Subdifferential). Given a convex function 
defined on 1Z 1 , and a point 6 G 1Z l , the subgradient of C at is defined as 
any vector, h, such that 



(h,x-6) +£{0) <£(x)yxeK l 



(99) 



If the function C is differentiable at 6 then the subgradient coincides with 
the (unique) gradient. As it is the case for the gradient, a subgradient defines 
a hyperplane. This hyperplane "supports" the epigraph of that is, the 
epigraph is on the one side of this hyperplane (see Fig. 32). At 0±, the 



convex function is differentiable and there is only one subgradient, which 
coincides with the gradient. Thus at this point, there is a simple hyperplane 
that supports the epigraph. At 02, the function is not differentiable. Hence 
there is an infinity of subgradients that define hyperplanes that support 
the epigraph. The set of all subgradients at a point 8 is known as the 
subdifferential and is denoted as dC, i.e., 

d£(0) := {hell 1 : {h,x-0)+£(x) < C{x),Mx G K 1 }. 
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Next, the subdifferential of C{6) := \9\, 6 G 1Z is given: 
dC(6) = 

where sgn(-) stands for the sign of a real number. For the geometry associ- 




ated to this cost function see Fig. 33 
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